Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • s/(?<![\xC2\xC3])([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg; # If chars > 0xFF extend appropriately.
    • Yeah, that’s an option. Not sure your regex is reliable enough; I think some invalid sequences can slip through that, which is no good. I’m also unsure that Latin-1 codepoints correspond 1:1 to Unicode codepoints. But yeah, I get your point.

      It would just take a lot of concentrated effort to ensure 100% correctness when taking that route, and I couldn’t be bothered with the bitfiddle this time (unlike that other time when I wrote codepoint-to-UTF-8 math in XPath within XSLT [plasmasturm.org] of all things)

      • : Not sure your regex is reliable enough; I think some invalid sequences can slip through that, which is no good.

        My regex replaces high-bit-set bytes that cannot be the trailing bytes of validly UTF-8 encoded Latin-1 high-bit-bytes with their UTF-8 bytes. What is invalid and what is not depends on your definition: is this

        0xC3 0xBF

        meant to be interpreted as a valid UTF-8 encoding of the one character U+00FF

        LATIN SMALL LETTER Y WITH DIAERESIS

        or as two characters

        LATIN CAPITAL LETTER A WITH TILDE (U+00C3 == 0xC3)

        followed by

        INVERTED QUESTION MARK (U+00BF == 0xBF)

        In other words, which interpretation gets to go first.

        Note that UTF-8 was purposefully engineered so that legal UTF-8 is highly unlikely (based on statistical sampling of existing legacy 8-bit texts) to be valid/sensible legacy 8-bit text.

        Since your input data is essentially corrupt (it is full of invalid UTF-8 sequences) you are going to end up with a best guess strategy which ever way you choose to go.

        If you might have also code points beyond U+00FF in your data, then I fully recommend the Encode way, the regex grows too cumbersome, or at least too ugly. This depends on whether your users have figured out how to input those fancy characters :-)

        : I’m also unsure that Latin-1 codepoints correspond 1:1 to Unicode codepoints.

        Latin-1 codepoints 0x00..0xff do correspond to Unicode codepoints U+0000..U+00FF 1:1, 100%, completely, fully, without doubt.

        • If you might have also code points beyond U+00FF in your data, then I fully recommend the Encode way, the regex grows too cumbersome, or at least too ugly. This depends on whether your users have figured out how to input those fancy characters :-)

          Ah, so that’s the assumption in your regex that I was vaguely aware of. Indeed, I cannot ignore codepoints beyond U+00FF. In particular, Unicode curly quotes (U+2019, U+201C, U+201D) and en- and em-dashes (U+2013, U+2014) are ubiquitous (and not at all ha