Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

gnat (29)

gnat
  (email not shown publicly)

Journal of gnat (29)

Tuesday July 01, 2003
04:25 PM

XPath

[ #13184 ]
One of the themes I heard consistently at YAPC was that every XML hacker who learned XPath said their life had changed. I played a little with XPath last night while writing the XML chapter for Cookbook 2ed, and I have to say--they're right. XPath is incredible.

For those of you who don't know, XPath provides a kind of structural regular expression for finding the bits of an XML parse tree that you're interested in. For example, here's the XPath for "the href attributes of a tags":

//a/@href

The double slash means "anywhere in the tree" (a single slash would mean you're giving a fully-qualified path to a node, starting from the root element) and @ indicates you're talking about an attribute and not an element.

Put in context, here's a program to print all the href attribute values of a tags:

#!/usr/bin/perl -w

use XML::LibXML;
use strict;

my $parser = XML::LibXML->new;
my $dom = $parser->parse_fh(\*DATA);
my @nodes = $dom->findnodes('//a/@href');

foreach my $node (@nodes) {
  print $node->value, "\n"; }

__END__
<doc>
<p>
<a href="http://www.perl.org/">The perl.org web site</a> is much
sexier than <a href="http://www.python.org/">that of Python</a>.
</p>
</doc>

That's so much easier than walking trees! Of course, an event-based parser could have handled this easily too. But when you have more complex requirements for things to extract, XPath even beats SAX for convenience.

For example, let's find the href attributes of a tags whose text contains Perl:

//a[contains(text(), "perl")]/@href

We can even go back up the tree, for example to find the p elements containing links:

//a/ancestor::p

XPath expressions get hairy fast, like regular expressions:

//a[contains(text(), "perl")]/ancestor::p

Like Perl 6 regular expressions, XPath expressions can be padded with whitespace (also similar, there are a few places where whitespace is significant):

//a
  [ contains( text(), "perl" ) ]
/ancestor::p

The only beef I have about it is that I can't find a robust HTML parser that'll give me XPath. The two pages I tried to parse were the O'Reilly Catalog and the use.perl homepage, and both had badly-formed HTML (overlapping tags, etc.). The W3C HTML validator had conniptions, as did XML::LibXML. I guess for real-world (i.e., broken) HTML, you still have to use regexps and the parsers described in TorgoX's book.

--Nat

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • XPath (Score:3, Informative)

    by ziggy (25) on 2003.07.01 16:45 (#21640) Journal
    XPath is incredible.
    Yep. Wrap your brain around this hack:
    document(//a/@href[contains(., '.html')])/html/head/title
    In the context of an XSLT stylesheet (or something that provides the document() function to retrieve a document by name), this little tidbit finds all of the links containing .html in the href, fetches them, parses them, and returns the title of each page.

    A spider. In one expression.

    Assign that to a nodeset and reapply the expression, and you're going two levels out. (Or just nest the document() functions into something really contorted.)

    The only beef I have about it is that I can't find a robust HTML parser that'll give me XPath.
    Did you try munging the HTML with tidy first? That works a decent amount of the time. (You can have tidy emit XML/XHTML if you don't want to deal with HTML parsers.)
    • Wow, tidy [sourceforge.net] is great! Thanks for the tip!

      (Morbus, you getting this for Spidering Hacks? :-)

      --Nat

      • Dammit, this is the secret I was going to reveal under the heading "When XPath won't work (and how to make it work anyway)".
    • You can use TagSoup (http://tagsoup.info), my SAX parser for HTML. I also have a version of Saxon 6 packaged with TagSoup for XSLT-ing arbitrary HTML.
  • Nat -

    Don't forget to talk about XML::LibXSLT's ability to write and register XPath extension functions written in Perl. :-)

    Of course 1.53 has memory bugs, but if you get Matt's CVS copy, you can have Perl callbacks from XSLT. This is incredibly useful; say you want access Apache req objects from XSLT, using closures, in a handler().

        $xslt->register_function($urn, 'get_request', sub { &get_request($self,@_) } );

    Write get_request() to handle arguments to an XPath function (which can b
  • See the parse_html_* [cpan.org] methods in LibXML.
    • I wasn't sufficiently clear in my original message. I was trying the parse_html_* methods in XML::LibXML and they were whining about broken HTML in the two pages I was playing with. So I said "screw it" and sent back to parsing those with HTML::* modules.

      --Nat

      • Doh. HTML parsers that can't parse broken HTML aren't that useful :)

        Have you tried HTML::TreeBuilder with Class::XPath [cpan.org]?
        • I haven't, but boy that's really cute. I was wondering the other day whether there were more general XPath modules available. You know, with a little optimization (the ability to search a tree once but have multiple possible XPath expressions and associated actions to run at each step), you could use XPath as the basis for your optimizer--write XPath expressions for the things to optimize.

          Ah yes, I've known about XPath for three days. Why wouldn't I assume I've had an original thought :-)

          --Nat

      • $parser->recovery(1)
        Fixes that problem.
        • Well, bollocks :-) I'd even seen that option in the manpage. This is what comes of doing your work at 3am, I guess ...

          Thanks!

          --Nat