We figured we could just identify which strings are UTF-16 (the default for ID3v2; UTF-8 is not even supported until ID3v2.4.0, which most software doesn't even support yet) and convert them to UTF-8.
if ($uniconvert && ($encoding eq "\001" || $encoding eq "\002")) { # UTF-16, UTF-16BE
my $u = Unicode::String::utf16($data);
$data = $u->utf8;
}
That worked fine, until we relalized that Unicode::String was leaving in the byte-order mark (BOM) and we don't want that. So we strip it out after the fact:
$data =~ s/^\xEF\xBB\xBF//; # strip BOM
Hopefully, that's the right thing. And it seems to work.
But then we realize that some tags might be Latin-1 and others might be UTF-8; so what to do? Well, we can convert everything to UTF-8, which will be fine, except that it will break things that want everything to be in Latin-1.
Bah.
I think we're going to make a switch of some kind to tell MP3::Info to convert everything to UTF-8. Bah, again, I say!
The Truth about Unicode (Score:1)
We all want Unicode to work, and there is no question that it is the Right Thing (tm) to do, being open, allowing other cultures to join us and use their own writing scheme and all.
The sad truth is that it is actually a huge pain in the ass to implement for most coders, at least in the US and especially in Europe, and I would be really interested to know if it makes things really easier for Asian coders.
Plus Unicode is usually being forced upon us by XML, which is never a nice thing when you are already f
Re:The Truth about Unicode (Score:2)
Re:The Truth about Unicode (Score:1)
Yes, XML parsers not only tend to die a swift but painful death when they encounter a Latin-1 (or 2 or more) character, even in a CDATA section, but also, at least XML::Parser converts everything to UTF-8, even if the rest of the environment is entirely Latin-n. This is extremely annoying as it adds an extra level of complexity to all applications, and forces people to care about encodings when really they don't want to.
Re:The Truth about Unicode (Score:3, Interesting)
Unicode's in some ways more of a change for them than for us--while ASCII maps to Unicode (especially the utf8 encoding) with no change, the same can not be said for the asian languages. For them Unicode's more than just an annoyance, it's something t
UTF8 versus Latin-1 (Score:1)
Re:UTF8 versus Latin-1 (Score:2)
Re:UTF8 versus Latin-1 (Score:1)
Or: if mp3s have an explicit settign that says what encoding something is, then presumably there's no guesswork involved at all.
Re:UTF8 versus Latin-1 (Score:2)
Re:UTF8 versus Latin-1 (Score:1)
But anyway. Ideally, calling applications (like a CGI that passes on the tag data) should make clear what kinds of data-encoding they can or can't cope with.