Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Matts (1087)

Matts
  (email not shown publicly)

I work for MessageLabs [messagelabs.com] in Toronto, ON, Canada. I write spam filters, MTA software, high performance network software, string matching algorithms, and other cool stuff mostly in Perl and C.

Journal of Matts (1087)

Friday February 28, 2003
06:52 AM

Parsing HTML headaches...

[ #10834 ]

XML::LibXML has a really great HTML parser in it, and I'm using it to parse HTML emails. The only problem is my email parser has already decoded any alternate encodings in the email (e.g. GB2312) down to UTF-8. Now when XML::LibXML sees these HTML documents if they happen to have:

<META http-equiv="Content-Type" content="text/html; charset=GB2312">

in them, then the parser treats them as GB2312. Ugh. If I strip out the META tag it seems to treat them as Latin-1 or something else completely by default. Its all very strange.

And it took me HOURS to figure out this is what was happening. I eventually found out (this morning, after having worked on this until late in the night) that the only way to get XML::LibXML to always recognise it as UTF-8 is to specify that its UTF-8 in the META tag. So I actually have to replace the META tag before even getting to the XML::LibXML (which seems a bit like parsing it before parsing it, but at least this works). In the end I lumped for this horrible pre-processing regexp:

my $meta = '<META http-equiv="Content-Type" content="text/html; charset=utf-8">';
unless ( $in =~ s/<META\s[^>]*charset=[\w-]*[^>]*>/$meta/gi ) {
  unless ( $in =~ s/<head>/<head>$meta/i ) {
    $in =~ s/<body>/<head>$meta<\/head><\/body>/i;
  }
}

I think there's probably more unless() blocks I need to add in there, but it has worked on all the emails I've tried it on so far.

With one exception (of course). MS-HTML generated by MS-Word. This is the most horrible monstrosity you've ever seen. In the end I punted - if I can't parse it properly with XML::LibXML I resort to piping it through lynx -dump. That kinda works even for MS-HTML, and although it'll be slower than the in-process XML::LibXML parsing, it only runs when I can't parse it the fast way.

Yes, this is why we should have had XML in the first place. Wish I could go back and fix history. *sigh*.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • tidy (Score:2, Informative)

    Tidy [sourceforge.net] has an option to clean up Word HTML [sourceforge.net] which might be handy, especially now there are Perl bindings [rcn.com].
  • The more popular XML is getting, the more it's becoming like HTML. RSS is the most end-user XML application, and validity of generated RSS is so bad reasonable numbers of people seem to have started writing non-XML parsers to read it and accept anything...
    • Well it doesn't help when you have something like XHTML, which is supposed to be a gateway drug to XML somehow, except that people write their XHTML in non-validating editors, and so the vast majority of XHTML out there isn't XHTML at all, and if it's not XML then it really is pointless to bother. Which, is why I support the "XHTML considered harmful" gang.

      If more people would use XSLT then that would improve the situation a lot, since it can only output valid XML (in most situations).

      These people who are
      • These people who are outputting bad RSS ... what tools are they using to create it?

        Probably Perl, or PHP or ASP. And not using tools, just using strings.