Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I've used XML::LibXML in HTML mode. It'll be far faster, and when you wrap it with the xsh language, it's even better (and xsh version 2.0 is getting some very neat features).
    --
    • Randal L. Schwartz
    • Stonehenge
    • Yeah, XML::LibXML in HTML mode would work, too.

      I just picked HTML::TreeBuilder::XPath because i thought it'd be more relaxed to handle non-well-balanced HTML. XML::LibXML::Parser says "HTML (strcit) documents" and that makes me a little nervous :)
    • libxml2’s HTML mode is lenient, but not very lenient. It’s not that hard to make it choke. For processing your own stuff (or for generally markup-sparse things like weblog posts or comments or such) it’s fine, but out there on the open web it doesn’t cut it.

      I prefer using HTMLTidy to beat things into shape, configured to give me XHTML, which I can then parse with a strict XML parser.

      TagSoup also works. (Someone should port that one to Perl and/or C…)