XML::LibXML has a really great HTML parser in it, and I'm using it to parse HTML emails. The only problem is my email parser has already decoded any alternate encodings in the email (e.g. GB2312) down to UTF-8. Now when XML::LibXML sees these HTML documents if they happen to have:
<META http-equiv="Content-Type" content="text/html; charset=GB2312">
in them, then the parser treats them as GB2312. Ugh. If I strip out the META tag it seems to treat them as Latin-1 or something else completely by default. Its all very strange.
And it took me HOURS to figure out this is what was happening. I eventually found out (this morning, after having worked on this until late in the night) that the only way to get XML::LibXML to always recognise it as UTF-8 is to specify that its UTF-8 in the META tag. So I actually have to replace the META tag before even getting to the XML::LibXML (which seems a bit like parsing it before parsing it, but at least this works). In the end I lumped for this horrible pre-processing regexp:
my $meta = '<META http-equiv="Content-Type" content="text/html; charset=utf-8">';
unless ( $in =~ s/<META\s[^>]*charset=[\w-]*[^>]*>/$meta/gi ) {
unless ( $in =~ s/<head>/<head>$meta/i ) {
$in =~ s/<body>/<head>$meta<\/head><\/body>/i;
}
}
I think there's probably more unless() blocks I need to add in there, but it has worked on all the emails I've tried it on so far.
With one exception (of course). MS-HTML generated by MS-Word. This is the most horrible monstrosity you've ever seen. In the end I punted - if I can't parse it properly with XML::LibXML I resort to piping it through lynx -dump. That kinda works even for MS-HTML, and although it'll be slower than the in-process XML::LibXML parsing, it only runs when I can't parse it the fast way.
Yes, this is why we should have had XML in the first place. Wish I could go back and fix history. *sigh*.
tidy (Score:2, Informative)
XML is doomed (Score:1)
Re:XML is doomed (Score:1)
If more people would use XSLT then that would improve the situation a lot, since it can only output valid XML (in most situations).
These people who are
Re:XML is doomed (Score:2)
Probably Perl, or PHP or ASP. And not using tools, just using strings.