Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • XPath (Score:3, Informative)

    by ziggy (25) on 2003.07.01 16:45 (#21640) Journal
    XPath is incredible.
    Yep. Wrap your brain around this hack:
    document(//a/@href[contains(., '.html')])/html/head/title
    In the context of an XSLT stylesheet (or something that provides the document() function to retrieve a document by name), this little tidbit finds all of the links containing .html in the href, fetches them, parses them, and returns the title of each page.

    A spider. In one expression.

    Assign that to a nodeset and reapply the expression, and you're going two levels out. (Or just nest the document() functions into something really contorted.)

    The only beef I have about it is that I can't find a robust HTML parser that'll give me XPath.
    Did you try munging the HTML with tidy first? That works a decent amount of the time. (You can have tidy emit XML/XHTML if you don't want to deal with HTML parsers.)
    • Wow, tidy [] is great! Thanks for the tip!

      (Morbus, you getting this for Spidering Hacks? :-)


      • Dammit, this is the secret I was going to reveal under the heading "When XPath won't work (and how to make it work anyway)".
    • You can use TagSoup (, my SAX parser for HTML. I also have a version of Saxon 6 packaged with TagSoup for XSLT-ing arbitrary HTML.
  • Nat -

    Don't forget to talk about XML::LibXSLT's ability to write and register XPath extension functions written in Perl. :-)

    Of course 1.53 has memory bugs, but if you get Matt's CVS copy, you can have Perl callbacks from XSLT. This is incredibly useful; say you want access Apache req objects from XSLT, using closures, in a handler().

        $xslt->register_function($urn, 'get_request', sub { &get_request($self,@_) } );

    Write get_request() to handle arguments to an XPath function (which can b
  • See the parse_html_* [] methods in LibXML.
    • I wasn't sufficiently clear in my original message. I was trying the parse_html_* methods in XML::LibXML and they were whining about broken HTML in the two pages I was playing with. So I said "screw it" and sent back to parsing those with HTML::* modules.


      • Doh. HTML parsers that can't parse broken HTML aren't that useful :)

        Have you tried HTML::TreeBuilder with Class::XPath []?
        • I haven't, but boy that's really cute. I was wondering the other day whether there were more general XPath modules available. You know, with a little optimization (the ability to search a tree once but have multiple possible XPath expressions and associated actions to run at each step), you could use XPath as the basis for your optimizer--write XPath expressions for the things to optimize.

          Ah yes, I've known about XPath for three days. Why wouldn't I assume I've had an original thought :-)


      • $parser->recovery(1)
        Fixes that problem.
        • Well, bollocks :-) I'd even seen that option in the manpage. This is what comes of doing your work at 3am, I guess ...