Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • s/(?<![\xC2\xC3])([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg; # If chars > 0xFF extend appropriately.
    • Yeah, that’s an option. Not sure your regex is reliable enough; I think some invalid sequences can slip through that, which is no good. I’m also unsure that Latin-1 codepoints correspond 1:1 to Unicode codepoints. But yeah, I get your point.

      It would just take a lot of concentrated effort to ensure 100% correctness when taking that route, and I couldn’t be bothered with the bitfiddle this time (unlike that other time when I wrote codepoint-to-UTF-8 math in XPath within XSLT [plasmasturm.org] of all things). That code up there took 3 minutes to write once I found the right fallback in the Encode docs, and I know it’s correct.

      But I might do it the hard way anyway at some other time.

      • : Not sure your regex is reliable enough; I think some invalid sequences can slip through that, which is no good.

        My regex replaces high-bit-set bytes that cannot be the trailing bytes of validly UTF-8 encoded Latin-1 high-bit-bytes with their UTF-8 bytes. What is invalid and what is not depends on your definition: is this

        0xC3 0xBF

        meant to be interpreted as a valid UTF-8 encoding of the one character U+00FF

        LATIN SMALL LETTER Y WITH DIAERESIS

        or as two characters

        LATIN CAPITAL LETTER A WITH TILDE (U+00C3 == 0x
        • If you might have also code points beyond U+00FF in your data, then I fully recommend the Encode way, the regex grows too cumbersome, or at least too ugly. This depends on whether your users have figured out how to input those fancy characters :-)

          Ah, so that’s the assumption in your regex that I was vaguely aware of. Indeed, I cannot ignore codepoints beyond U+00FF. In particular, Unicode curly quotes (U+2019, U+201C, U+201D) and en- and em-dashes (U+2013, U+2014) are ubiquitous (and not at all ha