NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.
All the Perl that's Practical to Extract and Report
Stories, comments, journals, and other submissions on use Perl; are Copyright 1998-2006, their respective owners.
or go for the jugular (Score:2)
Re:or go for the jugular (Score:1)
Yeah, that’s an option. Not sure your regex is reliable enough; I think some invalid sequences can slip through that, which is no good. I’m also unsure that Latin-1 codepoints correspond 1:1 to Unicode codepoints. But yeah, I get your point.
It would just take a lot of concentrated effort to ensure 100% correctness when taking that route, and I couldn’t be bothered with the bitfiddle this time (unlike that other time when I wrote codepoint-to-UTF-8 math in XPath within XSLT [plasmasturm.org] of all things)
Re:or go for the jugular (Score:2)
My regex replaces high-bit-set bytes that cannot be the trailing bytes of validly UTF-8 encoded Latin-1 high-bit-bytes with their UTF-8 bytes. What is invalid and what is not depends on your definition: is this
0xC3 0xBF
meant to be interpreted as a valid UTF-8 encoding of the one character U+00FF
LATIN SMALL LETTER Y WITH DIAERESIS
or as two characters
LATIN CAPITAL LETTER A WITH TILDE (U+00C3 == 0x
Re:or go for the jugular (Score:1)
Ah, so that’s the assumption in your regex that I was vaguely aware of. Indeed, I cannot ignore codepoints beyond
U+00FF. In particular, Unicode curly quotes (U+2019,U+201C,U+201D) and en- and em-dashes (U+2013,U+2014) are ubiquitous (and not at all ha