Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Aristotle (5147)

Aristotle
  pagaltzis@gmx.de
http://plasmasturm.org/

Blah blah blah blah blah [technorati.com]

Journal of Aristotle (5147)

Thursday April 06, 2006
02:54 PM

Repairing broken documents that mix UTF-8 and ISO-8859-1

[ #29246 ]

A perpetual (if thankfully not too frequent) problem on the web are documents claiming to be encoded in either UTF-8 or ISO-8859-1, but containing characters encoded according to the respective other charset. Such documents will display incorrectly, regardless of which way you look at them. Worse, if the document in question is XML (such as, say, a newsfeed) and claims to be encoded in UTF-8, upset ensues that leads the XML parser to halt and catch fire as soon as it encounters the first invalid byte.

How does it know? It does because UTF-8 has a very specific way of encoding non-ASCII characters. Encoding non-ASCII characters according to ISO-8859-1 violates this scheme, so their presence is detectable with a very high degree of confidence.

Of course, this can just as soon be used to good advantage. If you start with the working assumption that the primary encoding of a confusedly encoded document is UTF-8, and merely decode and re-encode the byte stream, you can salvage misencoded data by catching any character decoding errors and decoding the offending invalid bytes as ISO-8859-1.

Here’s a Perl script, cleverly called repair-utf8, which implements this approach:

#!/usr/bin/perl
use strict;
use warnings;

use Encode qw( decode FB_QUIET );

binmode STDIN, ':bytes';
binmode STDOUT, ':utf8';

my $out;

while( <> ) {
    $out = '';
    while( length ){
        $out .= decode( "utf-8", $_, FB_QUIET );
        $out .= decode( "iso-8859-1", substr( $_, 0, 1 ), FB_QUIET ) if length;
    }
    print $out;
}

The only non-obvious bit to be aware of here is that when using the FB_QUIET fallback mode, Encode will remove any successfully processed data from the input buffer. The entire script revolves around this behaviour. After the first decode, $_ will be empty if it was successfully decoded. If not, the successfully decoded part at the start of $_ will be returned, and $_ will be truncated from the front up to the offending byte. The second decode is then free to process that. The inner loop will keep running as long as any undecoded input is left, decoding it, if need be, one byte at a time as ISO-8859-1.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • s/(?<![\xC2\xC3])([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg; # If chars > 0xFF extend appropriately.
    • Yeah, that’s an option. Not sure your regex is reliable enough; I think some invalid sequences can slip through that, which is no good. I’m also unsure that Latin-1 codepoints correspond 1:1 to Unicode codepoints. But yeah, I get your point.

      It would just take a lot of concentrated effort to ensure 100% correctness when taking that route, and I couldn’t be bothered with the bitfiddle this time (unlike that other time when I wrote codepoint-to-UTF-8 math in XPath within XSLT [plasmasturm.org] of all things)

      • : Not sure your regex is reliable enough; I think some invalid sequences can slip through that, which is no good.

        My regex replaces high-bit-set bytes that cannot be the trailing bytes of validly UTF-8 encoded Latin-1 high-bit-bytes with their UTF-8 bytes. What is invalid and what is not depends on your definition: is this

        0xC3 0xBF

        meant to be interpreted as a valid UTF-8 encoding of the one character U+00FF

        LATIN SMALL LETTER Y WITH DIAERESIS

        or as two characters

        LATIN CAPITAL LETTER A WITH TILDE (U+00C3 == 0x
        • If you might have also code points beyond U+00FF in your data, then I fully recommend the Encode way, the regex grows too cumbersome, or at least too ugly. This depends on whether your users have figured out how to input those fancy characters :-)

          Ah, so that’s the assumption in your regex that I was vaguely aware of. Indeed, I cannot ignore codepoints beyond U+00FF. In particular, Unicode curly quotes (U+2019, U+201C, U+201D) and en- and em-dashes (U+2013, U+2014) are ubiquitous (and not at all ha