NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.
All the Perl that's Practical to Extract and Report
Stories, comments, journals, and other submissions on use Perl; are Copyright 1998-2006, their respective owners.
Why not XML::LibXML in HTML mode? (Score:2)
Re:Why not XML::LibXML in HTML mode? (Score:1)
libxml2’s HTML mode is lenient, but not very lenient. It’s not that hard to make it choke. For processing your own stuff (or for generally markup-sparse things like weblog posts or comments or such) it’s fine, but out there on the open web it doesn’t cut it.
I prefer using HTMLTidy to beat things into shape, configured to give me XHTML, which I can then parse with a strict XML parser.
TagSoup also works. (Someone should port that one to Perl and/or C…)
Reply to This
Parent