Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • s/(?<![\xC2\xC3])([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg; # If chars > 0xFF extend appropriately.
    • Yeah, that’s an option. Not sure your regex is reliable enough; I think some invalid sequences can slip through that, which is no good. I’m also unsure that Latin-1 codepoints correspond 1:1 to Unicode codepoints. But yeah, I get your point.

      It would just take a lot of concentrated effort to ensure 100% correctness when taking that route, and I couldn’t be bothered with the bitfiddle this time (unlike that other time when I wrote codepoint-to-UTF-8 math in XPath within XSLT [plasmasturm.org] of all things)

      • : Not sure your regex is reliable enough; I think some invalid sequences can slip through that, which is no good.

        My regex replaces high-bit-set bytes that cannot be the trailing bytes of validly UTF-8 encoded Latin-1 high-bit-bytes with their UTF-8 bytes. What is invalid and what is not depends on your definition: is this

        0xC3 0xBF

        meant to be interpreted as a valid UTF-8 encoding of the one character U+00FF

        LATIN SMALL LETTER Y WITH DIAERESIS

        or as two characters

        LATIN CAPITAL LETTER A WITH TILDE (U+00C3 == 0x
        • If you might have also code points beyond U+00FF in your data, then I fully recommend the Encode way, the regex grows too cumbersome, or at least too ugly. This depends on whether your users have figured out how to input those fancy characters :-)

          Ah, so that’s the assumption in your regex that I was vaguely aware of. Indeed, I cannot ignore codepoints beyond U+00FF. In particular, Unicode curly quotes (U+2019, U+201C, U+201D) and en- and em-dashes (U+2013, U+2014) are ubiquitous (and not at all hard for users to type), so I must assume that my UTF-8 data will contain many more high-bit-set byte values than just 0xC2/0xC3 that are still part of valid multibute sequences.

          As for your other question:

          is 0xC3 0xBF meant to be interpreted as a valid UTF-8 encoding of the one character U+00FF or as two characters U+00C3 followed by U+00BF

          It is to be interpreted as the UTF-8 encoding of U+00FF, as implied by my saying that the working assumption is that the primary encoding is UTF-8. In fact, this is the only way around which makes sense, because there are no invalid Latin-1-encoded sequences. You must assume UTF-8 so that you can actually have invalid sequences – of which you then conclude that they must be Latin-1 bytes.

          That’s the entire point of my post, actually.

          I’ve not tested the algorithm very widely so far, but so far it has been very accurate with whatever data I’ve thrown at it.

          Note that UTF-8 was purposefully engineered so that legal UTF-8 is highly unlikely (based on statistical sampling of existing legacy 8-bit texts) to be valid/sensible legacy 8-bit text.

          Oh, I know. It’s a marvel of design. Variable-width encodings suffer some inherent suckage, but UTF-8 pays this price in return for huge gains. It’s astonishingly clever and beautiful.