Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

miyagawa (1653)

miyagawa
  (email not shown publicly)
http://bulknews.vox.com/
AOL IM: bulknews (Add Buddy, Send Message)

Journal of miyagawa (1653)

Friday September 08, 2006
12:49 AM

Perl UTF-8 and latin-1 woes

[ #30923 ]

I've been thinking that I fully understand the Perl UTF-8 flag and Unicode stuff very well, with the professional experience handling I18N and L10N issues with Perl for more than 5 years.

But it turns out that I still have something to learn, or things I've learned recently at least.

So here's the code.

#!/usr/bin/perl
use strict;
use warnings;
use Encode;
use File::Temp qw(tempfile);
 
use XML::RSS;
use XML::RSS::LibXML;
use XML::Atom::Feed;
use Test::More 'no_plan';
 
$XML::Atom::ForceUnicode = 1;
$XML::Atom::DefaultVersion = "1.0";
 
my %data;
$data{latin1}  = "Diction" . chr(225) . "rios";
$data{utf8}    = "Diction" . "\xc3\xa1" . "rios";
$data{unicode} = decode_utf8($data{utf8});
 
my %code = (
    'XML::RSS' => \&test_xml_rss,
    'XML::RSS::LibXML' => \&test_xml_rss_libxml,
    'XML::Atom' => \&test_xml_atom,
);
 
for my $module (qw(XML::RSS XML::RSS::LibXML XML::Atom)) {
    for my $label (qw(latin1 utf8 unicode)) {
        $code{$module}->($data{$label}, $label);
    }
}
 
sub is_same {
    my($str1, $str2) = map _unicode($_), @_[0..1];
    is $str1, $str2, pop(@_);
}
 
sub _unicode {
    my $str = shift;
    return $str if utf8::is_utf8($str);
    return Encode::decode_utf8($str) if $str =~ /\xc3/;
    return Encode::decode('latin-1', $str);
}
 
sub test_xml_rss {
    my($string, $label) = @_;
 
    my $rss = XML::RSS->new;
    $rss->channel(title => $string);
 
    my $xml = $rss->as_string;
    diag "XML::RSS + $label: is_utf8() = ",  utf8::is_utf8($xml) ? 1 : 0;
 
    $rss = XML::RSS->new;
    eval {
        my $tmp = write_file($xml);
        $rss->parsefile($tmp);
        is_same $rss->channel->{title}, $string, "XML::RSS $label";
    };
    fail "XML::RSS $label" if $@;
}
 
sub test_xml_rss_libxml {
    my($string, $label) = @_;
 
    my $rss = XML::RSS::LibXML->new;
    $rss->channel(title => $string);
 
    my $xml = $rss->as_string;
    diag "XML::RSS::LibXML + $label: is_utf8() = ",  utf8::is_utf8($xml) ? 1 : 0;
 
    $rss = XML::RSS::LibXML->new;
    eval {
        my $tmp = write_file($xml);
        $rss->parsefile($tmp);
        is_same $rss->channel->{title}, $string, "XML::RSS::LibXML $label";
    };
    fail "XML::RSS::LibXML $label" if $@;
}
 
sub test_xml_atom {
    my($string, $label) = @_;
 
    my $feed = XML::Atom::Feed->new;
    $feed->title($string);
 
    my $xml = $feed->as_xml;
    diag "XML::Atom + $label: is_utf8() = ",  utf8::is_utf8($xml) ? 1 : 0;
 
    eval {
        my $tmp = write_file($xml);
        $feed = XML::Atom::Feed->new($tmp);
        is_same $feed->title, $string, "XML::Atom $label";
    };
    fail "XML::Atom $label" if $@;
}
 
sub write_file {
    my $data = shift;
    my($fh, $name) = tempfile(CLEANUP => 1);
    print $fh $data;
    close $fh;
    return $name;
}

8 out of 9 tests will fail. It's because XML::Atom and XML::RSS::LibXML's output method (as_string() and as_xml() specifically) returns UTF-8 flagged string, regardless of what the input data was. So, if and only if the string contains of characters less than 255 (= latin-1 range), perl will print them in latin-1, if we don't supply the encoding explicitly.

It doesn't happen if the characters contain Unicode characters larger than 255, which is quite annoying, in terms of consistency.

Obviously, the fix is to add:

binmode $fh, ":utf8";

before actually printing the XML to the file. Or Encode::encode_utf8 and other equivalent stuff. What makes things a bit worse is that there's no documentation (in XML::RSS::LibXML and XML::Atom) if the output data is utf-8 flagged or not. Even worse, XML::RSS output data may be utf-8 flagged or not, depending on the input.

As an author of XML::Atom, I was about to change the as_xml() implementation to force UTF-8 binary output, rather than UTF-8 flagged string. But I hesitate to push the code now, since it *might* break the backward compatibility. There could be some code that expects $feed->as_xml return Unicode string and open the filehandle with utf8 mode. That way, the users will get the double utf-8 encoded string.

I'm chatting this about Daisuke, the author of XML::RSS::LibXML and agreed it's all about documentation, or probably add another option to force UTF-8 binary output.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.