Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

gav (2710)

  (email not shown publicly)
AOL IM: flufflegavin (Add Buddy, Send Message)

Hacker in NYC.

Journal of gav (2710)

Friday August 27, 2004
12:03 PM

Encoding woes

[ #20601 ]

Somehow I ended up with a string containing &#147;Foo&#148; in a database (these are windows-1252 smart quotes). This then ended up in an XML file which had a declaration of <?xml version="1.0" encoding="UTF-8"?> but was being served with a HTTP Content-Type header of "text/xml; charset=iso-8893-1" due to a misunderstanding with CGI::Simple.

Strangely enough, it seemed to work in both FireFox and Internet Explorer. FireFox showed the smart quotes but IE chose to show the empty squares denoting some kind of bad character. The issue was then saving the XML to a file and re-opening it. IE was now chosing to point out that the broken XML was actually broken, but FireFox still seemed happy. FireFox was saving the file without the declaration and turning the broken characters in &#8220; and &#8221;. IE chose to decode the characters from windows-1252 and save them, thus with an UTF-8 declaration causing an error.

Using some code like Jacques Distler's StripControlChars MT Plugin, I fixed up the characters to UTF-8, fixed the header, and everybody was happy.

It seems that even though FireFox is trying to do the right thing, it's broken. The whole problem was caused by a bunch of seperate broken things all trying their best to work.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • Mark Pilgrim wrote an essay [] about getting the character set correct for XML over HTTP. Unfortunately even though XML makes dealing with character sets a bit more explicit, it's still got enough areas of pain to be a bother. Particularly when you find out things like all characters in an XML document are represented by a Unicode code point regardless of the source input encoding, except that some code points are specifically barred. Including U+0080 to U+009F, which is what you're looking at. Gah.