Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

TorgoX (1933)


"Il est beau comme la retractilité des serres des oiseaux rapaces [...] et surtout, comme la rencontre fortuite sur une table de dissection d'une machine à coudre et d'un parapluie !" -- Lautréamont

Journal of TorgoX (1933)

Sunday March 10, 2002
09:53 PM


[ #3420 ]
« The crap that Japanese people put up with in their software just because it has a smidgen of Japanese support is intolerable; it shouldn't be that 99% of American programmers have no idea what an umlaut is, or what the differences between Japanese and Chinese are. »
-- Ben's journal
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • It'd be nice to do, if it weren't so difficult. I've already somewhat documented the annoyances of trying to get UTF8 support in MP3::Info, and additionally in getting that to play nicely with Apache::MP3 (what if your MP3s are in UTF-8 and your directory names, also printed to the browser, are in Latin-1?). It is not an easy thing to do, and you need to weigh the cost versus the benefit.

    Consider that charsets are difficult to understand for those that don't already understand them, which a truism, but r
    • ...what if your MP3s are in UTF-8 and your directory names, also printed to the browser, are in Latin-1...

      Send them both with all characters over 0x80 encoded as &#number; entities. Does that solve the problem?

      • Apache::MP3 still needs to know how to encode the specific characters. Don't some characters over 0x80 differ between Latin-1 and UTF-8?
        • Latin-1 is a subset of Unicode.

          What do you mean by "Apache::MP3 still needs to know how to encode the specific characters."? What encoding to declare the HTML as being in? It doesn't matter, if everything outside of 00-7F is turned into &#number; (or %xx in a URL -- which you do to the bytes, not the characters, incidentally).

          • If Latin-1 is a subset of Unicode, then why do Latin-1 characters get munged when read as part of a UTF-8 document? I changed one letter of a directory to be ï (i with an umlaut) in Latin-1, and when read as UTF-8, it was messed up. When read as Latin-1, it was fine. In Latin-1, it has a value of decimal 239. Does it have the same value in UTF-8? If so, then what good would it be to print ï, since it's already known to be byte 239 ... wouldn't it still need to be specially encoded somehow so
            • OK, I think you're confusing the encoding and the content. Character point 239 is i-uml in both Latin-1 and Unicode. That's the content.

              However, you need to pick one of three encodings: as UTF8, as raw, or as an entity reference.

              • If you express 239 as UTF8, it's bytes 0xC3 0xAF, and you should express that this document is encoded as UTF8.
              • If you express 239 as raw (an encoding which works only for characters up to 0xFF) it's the single byte 0xEF.
              • If you encode it as entity reference, it's "ï",
              • I didn't confuse encoding and content, per se; I merely thought ï would, in UTF-8, stand for the byte 239, not character 239. Hum! OK, I'll play around a bit, thanks.
  • I do a lot of my journal reading in a library at university. Today I was doing it in the 2nd floor lab - just by the P-PS range of books (as in, I go out the door of the lab and am faced with P200-220).

    So I looked that book up, found it was P211, spotted it from the lab =) I'll borrow it when I leave.
      ---ict / Spoon