NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.
All the Perl that's Practical to Extract and Report
Stories, comments, journals, and other submissions on use Perl; are Copyright 1998-2006, their respective owners.
Preserving Entities (Score:2)
Will bite you in the end. Trust me. I've just come from a project where we've used hack upon hack upon hack upon hack to ensure that entities get preserved in one state or another. But the trouble is that you've effectively got several layers of character encoding. In our case, we ended up with stuff in the database which contained & et al. Well, in some tables we did. In others we had UTF-8. And the search engine saw character references and turned them into latin-1. Sometimes. So you never really knew what you were going to get back from the database, and what kind of escaping and transcoding it required. I have one particular function which first attempts to encode a string using UTF-8, and if that fails, fall back to latin-1.
Needless to say, I am horrified by all this. And it pretty much could all have been prevented by working on the xml infoset instead of getting involved with XML's lexical details. That way you would ensure that you have only a single known encoding of input data. You would then know to apply a single set of transformations to get it understood by a browser.
After 4 years of delivering websites where we have attempted to turn Unicode into something simpler for the thick-as-pig-shit browsers, we've gradually come to the conclusion that it's better to spit out UTF-8 regardless. If the user gets funny about characters, tell them to get a better font. Some of the Microsoft core fonts [sourceforge.net] are surprisingly good. They work really well in firefox. I just wish that the bitstream vera ones had such a large character repertoire.
Although you probably know most of it, this tutorial [skew.org] about Unicode and XML is worth a quick read.
Anyway, congratulations on finding that nasty bug!
-Dom
Reply to This
Re:Preserving Entities (Score:2)
We own the data that we are serving up through this web app. So it's fully normalized by the time it's parsed in this pipeline. The problem is more about keeping the entities that are in there from being converted into UTF-8.
Re:Preserving Entities (Score:2)
I'm still curious about the need for character references rather than UTF-8 bytes though. Which browsers were giving you trouble?
-Dom
Re:Preserving Entities (Score:1)
Re:Preserving Entities (Score:2)