Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • The encodings handling needs some cleaning up. Fortunately it doesn’t appear to be broken so badly as to be hard to fix.

    My “Expressiveness matters” [perl.org] post gets its curly quotes encoded “ and ”, respectively, which are undefined in the ISO-8859-1 charset the pages claim to be encoded in. They are only defined in Windows Codepage 1252. It still works browsers have generally given up and just treat the two as equal (which is doable because Win1252 is a true superset of Latin1), but correct it ain’t.

    But the same numeric entities are used in the RSS feed. Not only are you not forgiven for claiming to be Latin1 when you are Win1252 there, though, but numeric entities in XML always refer to Unicode codepoints [w3.org]. So the curly quotes must be encoded as “ and ”, respectively. As a result, all XML consumers show my post with “no such character” boxes around the title.

    The easiest thing to do (in terms of ensuring correctness, not necessarily in terms of implementation on a site with a huge amount of legacy content (though that might require nothing more than a dump/transcode/restore cycle of the database)) would be to just switch to UTF-8 wholesale. Then you can forget about numeric entities entirely and just encoding the five requisite characters (amp, lt, gt, apos, quot) with their named entities. (Or use ' instead of apos, since that named entity is defined only in XML, not in HTML.)

    • Not only are you not forgiven for claiming to be Latin1 when you are Win1252 there

      And you are not forgiven for *using* Win1252 in the first place. I am not sure it is correct for me to try to fix your mistake and guess at what character you intended. How can I know you meant those to be curly quotes, and not something else? Sure, those are undefined in Latin-1, but how do I know what charset you are using, if you're not using Latin-1?
      • Ugh! You are correct. The problem is precisely the aforementioned fact that browsers treat Latin1 as Win1252: the form is Latin1, so when I paste curly quotes, my browser throws its arms up and sends Win1252, instead of telling me. Gahhhh.

        Can we please have UTF-8 as soon as manageably possible? :-(

        • In Slash right now, we have special casing for high-bit chars, for sites that want plain ASCII. What I can probably do is add to that, for sites like useperl that are more open, special-casing those few chars from 128-159. It should catch most cases, like this one. It sucks, but ... so does the web. :-)

          As to UTF, we tried it once and it messed us up in various ways, largely due to browser support, so I am not eager to try again any time soon. I think this is the best way for now, converting everything
        • I implemented the special-casing for those few non-Latin-1 chars that browsers like to send. Your journal entry title now has the proper encoding.
          • It had it before the fix as well; after our exchange, I went and fixed the entities manually. If you want I can try seeing what happens if I change the entities back though.