Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • by pudge (1) on 2002.03.10 22:59 (#5692) Homepage Journal
    It'd be nice to do, if it weren't so difficult. I've already somewhat documented the annoyances of trying to get UTF8 support in MP3::Info, and additionally in getting that to play nicely with Apache::MP3 (what if your MP3s are in UTF-8 and your directory names, also printed to the browser, are in Latin-1?). It is not an easy thing to do, and you need to weigh the cost versus the benefit.

    Consider that charsets are difficult to understand for those that don't already understand them, which a truism, but relevant since most American computer programmers don't need to understand them. Consider that, similarly, most American computer programmers don't have a use for them, so adding support for them not only has no direct benefit, but additionally doesn't scratch that developer's itches. Blah blah blah. Is magical handling of I18N the next killer app?
    • ...what if your MP3s are in UTF-8 and your directory names, also printed to the browser, are in Latin-1...

      Send them both with all characters over 0x80 encoded as &#number; entities. Does that solve the problem?

      • Apache::MP3 still needs to know how to encode the specific characters. Don't some characters over 0x80 differ between Latin-1 and UTF-8?
        • Latin-1 is a subset of Unicode.

          What do you mean by "Apache::MP3 still needs to know how to encode the specific characters."? What encoding to declare the HTML as being in? It doesn't matter, if everything outside of 00-7F is turned into &#number; (or %xx in a URL -- which you do to the bytes, not the characters, incidentally).

          • If Latin-1 is a subset of Unicode, then why do Latin-1 characters get munged when read as part of a UTF-8 document? I changed one letter of a directory to be ï (i with an umlaut) in Latin-1, and when read as UTF-8, it was messed up. When read as Latin-1, it was fine. In Latin-1, it has a value of decimal 239. Does it have the same value in UTF-8? If so, then what good would it be to print ï, since it's already known to be byte 239 ... wouldn't it still need to be specially encoded somehow so
            • OK, I think you're confusing the encoding and the content. Character point 239 is i-uml in both Latin-1 and Unicode. That's the content.

              However, you need to pick one of three encodings: as UTF8, as raw, or as an entity reference.

              • If you express 239 as UTF8, it's bytes 0xC3 0xAF, and you should express that this document is encoded as UTF8.
              • If you express 239 as raw (an encoding which works only for characters up to 0xFF) it's the single byte 0xEF.
              • If you encode it as entity reference, it's "ï",
              • I didn't confuse encoding and content, per se; I merely thought ï would, in UTF-8, stand for the byte 239, not character 239. Hum! OK, I'll play around a bit, thanks.