For those of you who don't know, XPath provides a kind of structural regular expression for finding the bits of an XML parse tree that you're interested in. For example, here's the XPath for "the href attributes of a tags":
//a/@href
The double slash means "anywhere in the tree" (a single slash would mean you're giving a fully-qualified path to a node, starting from the root element) and @ indicates you're talking about an attribute and not an element.
Put in context, here's a program to print all the href attribute values of a tags:
#!/usr/bin/perl -w
use XML::LibXML;
use strict;
my $parser = XML::LibXML->new;
my $dom = $parser->parse_fh(\*DATA);
my @nodes = $dom->findnodes('//a/@href');
foreach my $node (@nodes) {
print $node->value, "\n"; }
__END__
<doc>
<p>
<a href="http://www.perl.org/">The perl.org web site</a> is much
sexier than <a href="http://www.python.org/">that of Python</a>.
</p>
</doc>
That's so much easier than walking trees! Of course, an event-based parser could have handled this easily too. But when you have more complex requirements for things to extract, XPath even beats SAX for convenience.
For example, let's find the href attributes of a tags whose text contains Perl:
//a[contains(text(), "perl")]/@href
We can even go back up the tree, for example to find the p elements containing links:
//a/ancestor::p
XPath expressions get hairy fast, like regular expressions:
//a[contains(text(), "perl")]/ancestor::p
Like Perl 6 regular expressions, XPath expressions can be padded with whitespace (also similar, there are a few places where whitespace is significant):
//a
[ contains( text(), "perl" ) ]
/ancestor::p
The only beef I have about it is that I can't find a robust HTML parser that'll give me XPath. The two pages I tried to parse were the O'Reilly Catalog and the use.perl homepage, and both had badly-formed HTML (overlapping tags, etc.). The W3C HTML validator had conniptions, as did XML::LibXML. I guess for real-world (i.e., broken) HTML, you still have to use regexps and the parsers described in TorgoX's book.
--Nat
XPath (Score:3, Informative)
document()function to retrieve a document by name), this little tidbit finds all of the links containing.htmlin the href, fetches them, parses them, and returns the title of each page.A spider. In one expression.
Assign that to a nodeset and reapply the expression, and you're going two levels out. (Or just nest the
Did you try munging the HTML withdocument()functions into something really contorted.)tidyfirst? That works a decent amount of the time. (You can have tidy emit XML/XHTML if you don't want to deal with HTML parsers.)Reply to This
Re:XPath (Score:2)
(Morbus, you getting this for Spidering Hacks? :-)
--Nat
Re:XPath (Score:1)
Re:XPath (Score:1)
Perl XPath functions (Score:1)
Don't forget to talk about XML::LibXSLT's ability to write and register XPath extension functions written in Perl.
Of course 1.53 has memory bugs, but if you get Matt's CVS copy, you can have Perl callbacks from XSLT. This is incredibly useful; say you want access Apache req objects from XSLT, using closures, in a handler().
$xslt->register_function($urn, 'get_request', sub { &get_request($self,@_) } );
Write get_request() to handle arguments to an XPath function (which can b
Re:Perl XPath functions (Score:2)
--Nat
LibXML and HTML (Score:2)
Re:LibXML and HTML (Score:2)
--Nat
Re:LibXML and HTML (Score:2)
Have you tried HTML::TreeBuilder with Class::XPath [cpan.org]?
Re:LibXML and HTML (Score:2)
Ah yes, I've known about XPath for three days. Why wouldn't I assume I've had an original thought :-)
--Nat
Re:LibXML and HTML (Score:2)
Re:LibXML and HTML (Score:2)
Thanks!
--Nat