http://svn.bulknews.net/repos/public/HTTP-Response-Charset/trunk/
So I created a module HTTP::Response::Charset, which detects a charset of HTTP response using various techniques (Content-Type, META tag, BOM, XML declaration and Encode::Detect). The motivation is to get correctly decoded Unicode string from any HTTP response, especially text/html, text/plain, XHTML and RSS/Atom.
The POD document has most of what I'd like to say, so go ahead and take a look at it. Also see Unit test suite using Test::Base to see the expected behavior.
After I wrote this code I google codesearched for a little and found that HTTP::Message, the base class of HTTP::Response has the exact decode_content() method which I wanted to implement using charset() value. Ugh.
Fortunately or unfortunately, current decode_content is slightly different from what I wanted to do. decode_content first decodes content body based on Content-Encoding header, like gzip, deflate or quoted-printable. Then, if the Content-Type is text/*, decode the content using charset value set in the header (or META tag if the response is HTML).
This might not be good enough for some corner cases:
1) If there's no charset= set in Content-Type nor META tag, it tries to decode text as latin-1 by default and gives corrupted Unicode data. (You can avoid that by saying $res->decoded_body(default_charset => 'none'), though)
2) It does Unicode decoding only for text/* response, which means if the response is application/xhtml+xml or application/atom+xml, it doesn't.
(Note that I'm not saying this is a bug. For XML data you don't need to decode the text portion by yourself, since most XML parser detects the encoding when it processes XML declaration and adds UTF-8 flag internally)
Update: per Matts this is a bug. Should I better send a patch to Gisle to decode when Content-Type matches application/(*+)xml?
So I hope this module supplements the situation. For 1), you can pass
$res->decode_content(charset => $res->charset)
to deal with the HTTP response without charset set. For 2), You can just say
Encode::decode($res->charset, $res->content);
for whatever MIME types you'd like to decode.
However, to decode gzip encoded and BOMed XML data correctly for instance, you need to write this way:
my $content = Encode::decode($res->charset, $res->decoded_content(charset => 'none'));
which looks a little kludgy.
I'm not sure if it's a good thing to hack (or extend) decoded_content method, or add another convenience method to do the right thing.
Any feedbacks welcome.
No, it is a bug... (Score:2)
This is a bug - for XML you're supposed to be able to declare the encoding in the protocol. It's part of the spec (but I'm too lazy to look it up right now).
Re: (Score:1)
Thanks!
Re: (Score:2)
Augh META (Score:1)
The META tag is a nasty hack. Originally, servers were supposed to parse the outgoing document, insert the given headers into the HTTP header, and drop the tag on the floor (yes, really). Instead, clients now parse the body and then retroactively pretend the META tags had been part of the HTTP header. That leads to various kinds of nastiness. The whole is ugly and nasty and painful.
Please only respect it when found in text/html content – in application/xhtml+xml you should never ever look at it.
Re: (Score:1)
Re: (Score:1)
In short, the rules for XML MIME types (m{application/(.*\+)?xml}) are that if the HTTP header says anything, it is authoritative; otherwise, the XML parser gets to decide from the byte stream. If you are planning to pre-decode XML content as a courtesy for people who may want to do something other than pass it to an XML parser, you should read the XML spec; it has a clear outline of the algorithm an XML parser uses to detect the encoding.
But if you do that, be aware that XML parsers will want to decode t
Re: (Score:1)
And I removed the META detection if MIME-type is application/xhtml+xml. Actually I could remove the entire META detection code, since it's already done in LWP::UserAgent (and LWP::Protocol) unless you call
$ua->parse_head(0)explicitly.If you are planning to pre-dec
Re: (Score:1)
Oh. I was referring to RFC 2616 (HTTP/1.1) Section 3.7.1, which says:
XML is not text (Score:2)
While this binary data does have text in it, not all of it actually is. The text in the binary data is encoded, and the character set for this has to be given. Indeed, with the charset attribute in the Content-Type, or in the <?xml?> declaration.
You need to know the right encoding, and when it's not in the document itself, you need to pass it to your XML parser, so it knows how to handle the
Re: (Score:1)
Sure, I know. I maintain XML::Atom, some Encode modules and do I18N stuff every day with Perl 5.8 server side
And any parser that does know about Perl's text strings wouldn't really be XML compliant, because XML compliancy involves handling character encodings.
Agreed. As I said on the
Re: (Score:1)
Oh? Why?
A major point of the preamble as specified is that intermediaries that need to mung the content be able to safely and reliably change the preamble to reflect their modifations. (F.ex., you might want to send XML over a transport that isn’t 8-bit-safe, in which case you can transcode the document to UTF-7 and not have to parse and re-emit it as US-ASCII with entities.)
Other than
Re: (Score:2)
But I do think it can be summarized, so that I can avoid the explicit why-question: to modify correctly, you need to parse. You can't parse before parsing, it's an infinite loop.
Re: (Score:1)
An XML parser has to solve the same quandary: before it can parse the document, it has to decode it, but to decode it, it needs to know the encoding, which is specified in the preamble. Oops.
Well, no. The preamble is actually a highly restricted protocol that you can implement without having to parse more of the document. Read the XML spec – there is a very clear outline of the exact layout of the preamble in terms of the actual possible octet sequences. (It’s impressive how many niggles they
Re: (Score:2)
I think I'll just keep passing XML around as I get it, without modifying it before it reaches the parser.
So far, I haven't used XML with non-XML parsers.
Re: (Score:1)
Yeah, I wasn’t saying you should transcode the document (usually you shouldn’t) – just that if you find yourself needing to, you can do it.