So I created a module HTTP::Response::Charset, which detects a charset of HTTP response using various techniques (Content-Type, META tag, BOM, XML declaration and Encode::Detect). The motivation is to get correctly decoded Unicode string from any HTTP response, especially text/html, text/plain, XHTML and RSS/Atom.
The POD document has most of what I'd like to say, so go ahead and take a look at it. Also see Unit test suite using Test::Base to see the expected behavior.
After I wrote this code I google codesearched for a little and found that HTTP::Message, the base class of HTTP::Response has the exact decode_content() method which I wanted to implement using charset() value. Ugh.
Fortunately or unfortunately, current decode_content is slightly different from what I wanted to do. decode_content first decodes content body based on Content-Encoding header, like gzip, deflate or quoted-printable. Then, if the Content-Type is text/*, decode the content using charset value set in the header (or META tag if the response is HTML).
This might not be good enough for some corner cases:
1) If there's no charset= set in Content-Type nor META tag, it tries to decode text as latin-1 by default and gives corrupted Unicode data. (You can avoid that by saying $res->decoded_body(default_charset => 'none'), though)
2) It does Unicode decoding only for text/* response, which means if the response is application/xhtml+xml or application/atom+xml, it doesn't.
(Note that I'm not saying this is a bug. For XML data you don't need to decode the text portion by yourself, since most XML parser detects the encoding when it processes XML declaration and adds UTF-8 flag internally)
Update: per Matts this is a bug. Should I better send a patch to Gisle to decode when Content-Type matches application/(*+)xml?
So I hope this module supplements the situation. For 1), you can pass
$res->decode_content(charset => $res->charset)
to deal with the HTTP response without charset set. For 2), You can just say
for whatever MIME types you'd like to decode.
However, to decode gzip encoded and BOMed XML data correctly for instance, you need to write this way:
my $content = Encode::decode($res->charset, $res->decoded_content(charset => 'none'));
which looks a little kludgy.
I'm not sure if it's a good thing to hack (or extend) decoded_content method, or add another convenience method to do the right thing.
Any feedbacks welcome.