I've been thinking that I fully understand the Perl UTF-8 flag and Unicode stuff very well, with the professional experience handling I18N and L10N issues with Perl for more than 5 years.
But it turns out that I still have something to learn, or things I've learned recently at least.
So here's the code.
#!/usr/bin/perl
use strict;
use warnings;
use Encode;
use File::Temp qw(tempfile);
use XML::RSS;
use XML::RSS::LibXML;
use XML::Atom::Feed;
use Test::More 'no_plan';
$XML::Atom::ForceUnicode = 1;
$XML::Atom::DefaultVersion = "1.0";
my %data;
$data{latin1} = "Diction" . chr(225) . "rios";
$data{utf8} = "Diction" . "\xc3\xa1" . "rios";
$data{unicode} = decode_utf8($data{utf8});
my %code = (
'XML::RSS' => \&test_xml_rss,
'XML::RSS::LibXML' => \&test_xml_rss_libxml,
'XML::Atom' => \&test_xml_atom,
);
for my $module (qw(XML::RSS XML::RSS::LibXML XML::Atom)) {
for my $label (qw(latin1 utf8 unicode)) {
$code{$module}->($data{$label}, $label);
}
}
sub is_same {
my($str1, $str2) = map _unicode($_), @_[0..1];
is $str1, $str2, pop(@_);
}
sub _unicode {
my $str = shift;
return $str if utf8::is_utf8($str);
return Encode::decode_utf8($str) if $str =~/\xc3/;
return Encode::decode('latin-1', $str);
}
sub test_xml_rss {
my($string, $label) = @_;
my $rss = XML::RSS->new;
$rss->channel(title => $string);
my $xml = $rss->as_string;
diag "XML::RSS + $label: is_utf8() = ", utf8::is_utf8($xml) ? 1 : 0;
$rss = XML::RSS->new;
eval {
my $tmp = write_file($xml);
$rss->parsefile($tmp);
is_same $rss->channel->{title}, $string, "XML::RSS $label";
};
fail "XML::RSS $label" if $@;
}
sub test_xml_rss_libxml {
my($string, $label) = @_;
my $rss = XML::RSS::LibXML->new;
$rss->channel(title => $string);
my $xml = $rss->as_string;
diag "XML::RSS::LibXML + $label: is_utf8() = ", utf8::is_utf8($xml) ? 1 : 0;
$rss = XML::RSS::LibXML->new;
eval {
my $tmp = write_file($xml);
$rss->parsefile($tmp);
is_same $rss->channel->{title}, $string, "XML::RSS::LibXML $label";
};
fail "XML::RSS::LibXML $label" if $@;
}
sub test_xml_atom {
my($string, $label) = @_;
my $feed = XML::Atom::Feed->new;
$feed->title($string);
my $xml = $feed->as_xml;
diag "XML::Atom + $label: is_utf8() = ", utf8::is_utf8($xml) ? 1 : 0;
eval {
my $tmp = write_file($xml);
$feed = XML::Atom::Feed->new($tmp);
is_same $feed->title, $string, "XML::Atom $label";
};
fail "XML::Atom $label" if $@;
}
sub write_file {
my $data = shift;
my($fh, $name) = tempfile(CLEANUP => 1);
print $fh $data;
close $fh;
return $name;
}
8 out of 9 tests will fail. It's because XML::Atom and XML::RSS::LibXML's output method (as_string() and as_xml() specifically) returns UTF-8 flagged string, regardless of what the input data was. So, if and only if the string contains of characters less than 255 (= latin-1 range), perl will print them in latin-1, if we don't supply the encoding explicitly.
It doesn't happen if the characters contain Unicode characters larger than 255, which is quite annoying, in terms of consistency.
Obviously, the fix is to add:
binmode $fh, ":utf8";
before actually printing the XML to the file. Or Encode::encode_utf8 and other equivalent stuff. What makes things a bit worse is that there's no documentation (in XML::RSS::LibXML and XML::Atom) if the output data is utf-8 flagged or not. Even worse, XML::RSS output data may be utf-8 flagged or not, depending on the input.
As an author of XML::Atom, I was about to change the as_xml() implementation to force UTF-8 binary output, rather than UTF-8 flagged string. But I hesitate to push the code now, since it *might* break the backward compatibility. There could be some code that expects $feed->as_xml return Unicode string and open the filehandle with utf8 mode. That way, the users will get the double utf-8 encoded string.
I'm chatting this about Daisuke, the author of XML::RSS::LibXML and agreed it's all about documentation, or probably add another option to force UTF-8 binary output.
Perl UTF-8 and latin-1 woes 0 Comments More | Login | Reply /