Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

pudge (1)

pudge
  (email not shown publicly)
http://pudge.net/
AOL IM: Crimethnk (Add Buddy, Send Message)

I run this joint, see?

Journal of pudge (1)

Friday February 15, 2002
11:58 AM

MP3::Info and Unicode

[ #2889 ]
So Che_Fox wants MP3::Info to handle Unicode strings. Well, he and others had recently helped me fix some problems with MP3::Info on ID3v2 tags and encoding bytes, so sure, let's look at it.

We figured we could just identify which strings are UTF-16 (the default for ID3v2; UTF-8 is not even supported until ID3v2.4.0, which most software doesn't even support yet) and convert them to UTF-8.

if ($uniconvert && ($encoding eq "\001" || $encoding eq "\002")) {  # UTF-16, UTF-16BE
    my $u = Unicode::String::utf16($data);
    $data = $u->utf8;
}

That worked fine, until we relalized that Unicode::String was leaving in the byte-order mark (BOM) and we don't want that. So we strip it out after the fact:

    $data =~ s/^\xEF\xBB\xBF//;    # strip BOM

Hopefully, that's the right thing. And it seems to work.

But then we realize that some tags might be Latin-1 and others might be UTF-8; so what to do? Well, we can convert everything to UTF-8, which will be fine, except that it will break things that want everything to be in Latin-1.

Bah.

I think we're going to make a switch of some kind to tell MP3::Info to convert everything to UTF-8. Bah, again, I say!

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • We all want Unicode to work, and there is no question that it is the Right Thing (tm) to do, being open, allowing other cultures to join us and use their own writing scheme and all.

    The sad truth is that it is actually a huge pain in the ass to implement for most coders, at least in the US and especially in Europe, and I would be really interested to know if it makes things really easier for Asian coders.

    Plus Unicode is usually being forced upon us by XML, which is never a nice thing when you are already f

    • Is it worse in Europe specifically because UTF-8 and 8-bit Latin-1 are incompatible?
      • Yes, XML parsers not only tend to die a swift but painful death when they encounter a Latin-1 (or 2 or more) character, even in a CDATA section, but also, at least XML::Parser converts everything to UTF-8, even if the rest of the environment is entirely Latin-n. This is extremely annoying as it adds an extra level of complexity to all applications, and forces people to care about encodings when really they don't want to.

    • The problem with the Asian languages is most of them already have a perfectly serviceable local standard. Big5 (traditional and simplified) for Chinese and Shift-JIS (amongst others) for Japanese. Korean and Vietnamese also have standards that work just fine.

      Unicode's in some ways more of a change for them than for us--while ASCII maps to Unicode (especially the utf8 encoding) with no change, the same can not be said for the asian languages. For them Unicode's more than just an annoyance, it's something t
  • Why not just write things as Latin-1 if they consist only of characters [\x00-xFF], and UTF8 otherwise?
    • Won't those characters show up wrongly when you expect to see UTF-8 characters, then? I don't really understand. Let's say I have ÿ, \xFF. I assume that character has some other byte representation in UTF-8. But how is that byte represented in UTF-8? Do you understand what it is that I don't understand?
      • I'm assuming all mp3-readers auto-detect encoding, so there's no "expecting to see UTF8" -- if you see UTF8, you see it and decode it as such, otherwise you assume it's something else. Remember, pretty much only UTF8 looks like UTF8.

        Or: if mp3s have an explicit settign that says what encoding something is, then presumably there's no guesswork involved at all.

        • MP3 tags aren't just for MP3 readers, they are for web browsers, databases, text files of various kinds, etc.
          • My "mp3 reader", I mean anything that accesses the tag data in the files, including libraries that just pass it on to other applications.

            But anyway. Ideally, calling applications (like a CGI that passes on the tag data) should make clear what kinds of data-encoding they can or can't cope with.