Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

miyagawa (1653)

  (email not shown publicly)
AOL IM: bulknews (Add Buddy, Send Message)

Journal of miyagawa (1653)

Saturday October 07, 2006
04:22 PM


[ #31250 ]

So I created a module HTTP::Response::Charset, which detects a charset of HTTP response using various techniques (Content-Type, META tag, BOM, XML declaration and Encode::Detect). The motivation is to get correctly decoded Unicode string from any HTTP response, especially text/html, text/plain, XHTML and RSS/Atom.

The POD document has most of what I'd like to say, so go ahead and take a look at it. Also see Unit test suite using Test::Base to see the expected behavior.

After I wrote this code I google codesearched for a little and found that HTTP::Message, the base class of HTTP::Response has the exact decode_content() method which I wanted to implement using charset() value. Ugh.

Fortunately or unfortunately, current decode_content is slightly different from what I wanted to do. decode_content first decodes content body based on Content-Encoding header, like gzip, deflate or quoted-printable. Then, if the Content-Type is text/*, decode the content using charset value set in the header (or META tag if the response is HTML).

This might not be good enough for some corner cases:

1) If there's no charset= set in Content-Type nor META tag, it tries to decode text as latin-1 by default and gives corrupted Unicode data. (You can avoid that by saying $res->decoded_body(default_charset => 'none'), though)

2) It does Unicode decoding only for text/* response, which means if the response is application/xhtml+xml or application/atom+xml, it doesn't.

(Note that I'm not saying this is a bug. For XML data you don't need to decode the text portion by yourself, since most XML parser detects the encoding when it processes XML declaration and adds UTF-8 flag internally)

Update: per Matts this is a bug. Should I better send a patch to Gisle to decode when Content-Type matches application/(*+)xml?

So I hope this module supplements the situation. For 1), you can pass

$res->decode_content(charset => $res->charset)

to deal with the HTTP response without charset set. For 2), You can just say

Encode::decode($res->charset, $res->content);

for whatever MIME types you'd like to decode.

However, to decode gzip encoded and BOMed XML data correctly for instance, you need to write this way:

my $content = Encode::decode($res->charset, $res->decoded_content(charset => 'none'));

which looks a little kludgy.

I'm not sure if it's a good thing to hack (or extend) decoded_content method, or add another convenience method to do the right thing.

Any feedbacks welcome.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • (Note that I'm not saying this is a bug. For XML data you don't need to decode the text portion by yourself, since most XML parser detects the encoding when it processes XML declaration and adds UTF-8 flag internally)

    This is a bug - for XML you're supposed to be able to declare the encoding in the protocol. It's part of the spec (but I'm too lazy to look it up right now).
    • Good to know and it'll be a good rationale for the module. Yeah, reading [] assures me that the charset parameter is important for both text/xml and application/xml, and XML/MIME parsers should respect that.

      • Yeah it's one of the corners of the XML spec I'm never quite sure they did the right thing on, and I would doubt many XML processing systems respect, but that's what we have to live with :-)
  • The META tag is a nasty hack. Originally, servers were supposed to parse the outgoing document, insert the given headers into the HTTP header, and drop the tag on the floor (yes, really). Instead, clients now parse the body and then retroactively pretend the META tags had been part of the HTTP header. That leads to various kinds of nastiness. The whole is ugly and nasty and painful.

    Please only respect it when found in text/html content – in application/xhtml+xml you should never ever look at it.

    • I agree. In application/xhtml+xml we should just look at BOM or XML encoding declaration, then. Changing that would be an one-line fix. Thoughts?
      • In short, the rules for XML MIME types (m{application/(.*\+)?xml}) are that if the HTTP header says anything, it is authoritative; otherwise, the XML parser gets to decide from the byte stream. If you are planning to pre-decode XML content as a courtesy for people who may want to do something other than pass it to an XML parser, you should read the XML spec; it has a clear outline of the algorithm an XML parser uses to detect the encoding.

        But if you do that, be aware that XML parsers will want to decode t

        • To be clear, what this module does is nothing new. This is a part of CPANization of our Plagger code to deal with real world feeds and HTML content. This code has been in my daily use against thousands of feeds and have been doing pretty good.

          And I removed the META detection if MIME-type is application/xhtml+xml. Actually I could remove the entire META detection code, since it's already done in LWP::UserAgent (and LWP::Protocol) unless you call $ua->parse_head(0) explicitly.

          If you are planning to pre-dec
          • Oh. I was referring to RFC 2616 (HTTP/1.1) Section 3.7.1, which says:

            The “charset” parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the “text” type are defined to have a default charset value of “ISO-8859-1” when received via HTTP. Data in character sets other than “ISO-8859-1” or its subsets MUST be labeled with an appropriate chars

  • XML is not text, it's binary data that looks like text. It needs to live in a binary string, not a text string.

    While this binary data does have text in it, not all of it actually is. The text in the binary data is encoded, and the character set for this has to be given. Indeed, with the charset attribute in the Content-Type, or in the <?xml?> declaration.

    You need to know the right encoding, and when it's not in the document itself, you need to pass it to your XML parser, so it knows how to handle the
    • If you decode() the data, then you're in trouble. The string returned by Perl's decode() is a Perl text string, which is a unicode string, but not a UTF-8 string, not a ISO-8859-1 string, not a Windows-1252 string, etcetera.

      Sure, I know. I maintain XML::Atom, some Encode modules and do I18N stuff every day with Perl 5.8 server side :)

      And any parser that does know about Perl's text strings wouldn't really be XML compliant, because XML compliancy involves handling character encodings.

      Agreed. As I said on the
    • But your instinct should tell you that modifying the document before parsing it for real is incredibly bad.

      Oh? Why?

      A major point of the preamble as specified is that intermediaries that need to mung the content be able to safely and reliably change the preamble to reflect their modifations. (F.ex., you might want to send XML over a transport that isn’t 8-bit-safe, in which case you can transcode the document to UTF-7 and not have to parse and re-emit it as US-ASCII with entities.)

      Other than

      • As I said, it's about instincts. So defining "why" will be hard ;)

        But I do think it can be summarized, so that I can avoid the explicit why-question: to modify correctly, you need to parse. You can't parse before parsing, it's an infinite loop.
        • An XML parser has to solve the same quandary: before it can parse the document, it has to decode it, but to decode it, it needs to know the encoding, which is specified in the preamble. Oops.

          Well, no. The preamble is actually a highly restricted protocol that you can implement without having to parse more of the document. Read the XML spec – there is a very clear outline of the exact layout of the preamble in terms of the actual possible octet sequences. (It’s impressive how many niggles they

          • Then again the XML specification and my instincts are in heavy disagreement. But this time it works out pretty well. Neat and interesting approach of things, but of course this is yet another XML thing that's nice in theory but hard to implement correctly.

            I think I'll just keep passing XML around as I get it, without modifying it before it reaches the parser.

            So far, I haven't used XML with non-XML parsers.
            • Yeah, I wasn’t saying you should transcode the document (usually you shouldn’t) – just that if you find yourself needing to, you can do it.