Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • XPath (Score:3, Informative)

    by ziggy (25) on 2003.07.01 16:45 (#21640) Journal
    XPath is incredible.
    Yep. Wrap your brain around this hack:
    document(//a/@href[contains(., '.html')])/html/head/title
    In the context of an XSLT stylesheet (or something that provides the document() function to retrieve a document by name), this little tidbit finds all of the links containing .html in the href, fetches them, parses them, and returns the title of each page.

    A spider. In one expression.

    Assign that to a nodeset and reapply the expression, and you're going two levels out. (Or just nest the document() functions into something really contorted.)

    The only beef I have about it is that I can't find a robust HTML parser that'll give me XPath.
    Did you try munging the HTML with tidy first? That works a decent amount of the time. (You can have tidy emit XML/XHTML if you don't want to deal with HTML parsers.)
    • Wow, tidy [] is great! Thanks for the tip!

      (Morbus, you getting this for Spidering Hacks? :-)


      • Dammit, this is the secret I was going to reveal under the heading "When XPath won't work (and how to make it work anyway)".
    • You can use TagSoup (, my SAX parser for HTML. I also have a version of Saxon 6 packaged with TagSoup for XSLT-ing arbitrary HTML.