Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • My hammer of choice is forcing the input to XHTML using HTMLTidy, then attacking it with XPath. XPath rocks extremely hard. HTML::Tidy (there’s Andy Lester again) and XML::LibXML are excellent tools for this approach.

    • by grantm (164) on 2005.09.16 15:56 (#43330) Homepage Journal
      Do you really need to tidy first? I just make sure recover mode is turned on and use LibXML's parse_html method - works for me. Maybe you have to deal with uglier HTML than I do.
      • I didn’t think of that because I actually use XSLT most of the time (nowadays a Perl wrapper script around XML::LibXSLT and the aforementioned modules), and there’s something really strange going on with namespaces in a DOM built using libxml’s HTML parser, which causes strange misbehaviour in XSL transforms that I never figured out (just had hours of debugging fun with). When I started out, I didn’t even have the option because I was in fact using libxslt’s xsltproc utility, a