Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • My hammer of choice is forcing the input to XHTML using HTMLTidy, then attacking it with XPath. XPath rocks extremely hard. HTML::Tidy (there’s Andy Lester again) and XML::LibXML are excellent tools for this approach.

    • Do you really need to tidy first? I just make sure recover mode is turned on and use LibXML's parse_html method - works for me. Maybe you have to deal with uglier HTML than I do.
      • I didn’t think of that because I actually use XSLT most of the time (nowadays a Perl wrapper script around XML::LibXSLT and the aforementioned modules), and there’s something really strange going on with namespaces in a DOM built using libxml’s HTML parser, which causes strange misbehaviour in XSL transforms that I never figured out (just had hours of debugging fun with). When I started out, I didn’t even have the option because I was in fact using libxslt’s xsltproc utility, and that doesn’t even a way to parse HTML input.

        If you’re actually just parsing the input using libxml and then do all the work in Perl, you can probably get away without turning things into XHTML first, you’re right.