Here is more example code that demonstrates how to use XPath and CSS Selector to do screen scraping without using nasty regular expressions.
The task is "Access search.cpan.org for XML and extract 1) how many modules are there and 2) link to the PODs with module names"
There you go:
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Encode;
use HTML::Selector::XPath;
use HTML::TreeBuilder::XPath;
use WWW::Mechanize;
binmode STDOUT, ":utf8";
my $mech = WWW::Mechanize->new;
$mech->get("http://search.cpan.org/search?query=XML&mode=a ll");
my $count = $mech->xpath(q|//div[@class='t4']/small/b[3]|);
print "Count: ", $count->content->[0], "\n";
my @links = $mech->selector("p > a:first-child");
for (@links) {
print "Module: ", $_->content->[0]->content->[0], "\n";
print "Link: ", $_->attr('href'), "\n";
}
sub WWW::Mechanize::selector {
my($mech, $selector) = @_;
$mech->xpath(HTML::Selector::XPath->new($selector)->to_xpath);
}
sub WWW::Mechanize::xpath {
my($mech, $xpath) = @_;
my @ct = $mech->response->header('Content-Type');
my $content;
if ($ct[0] && $ct[0] =~/charset=([\w\-]+)/) {
$content = decode($1, $mech->content);
} else {
$content = decode_utf8($mech->content);
}
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse($content);
$tree->eof;
my @nodes = $tree->findnodes($xpath);
return wantarray ? @nodes : $nodes[0];
}
Pretty simple and maintainable, but a couple of things:
1) WWW::Mechanize::selector and
2) Using CSS selector from HTML::TreeBuilder::XPath would be a win even if you don't use WWW::Mechanize. Probably 2 lines of code for a new module?
3) Guessing charset from HTTP response header could be separated out to a separate module (HTTP::Response::GuessCharset?). This can be more robust using what we use in Plagger::Util::decode_content, which detects charset code even from meta tag, XML declaration and using Encode::Detect.
4) I hate HTML::Element's content->[0] and content_list stuff. All I want is just "content of the children as HTML (or text)" and "attributes as a hash reference" ($elem->all_external_attr does this).
Excellent idea (Score:2)
If that's possible, I would be totally happy to include CSS selectors in HTML::TreeBuilder::XPath (and actually even in XML::XPathEngine). I would love the module to auto-detect which query language is used, but I don't think that's possible, as the syntax overlap.
mirod
Re: (Score:1)
Re: (Score:1)
Re: (Score:1)
Looks like CSS::SAC on CPAN is not updated for a long time (the last update is September 2004) and it's not a bad thing to have a separate, pure perl (and independent of any CPAN module) would not be a bad thing, though.
Re: (Score:1)
Re: (Score:1)
Re: (Score:2)
I hadn't looked at this at all, but I see that your HTML::Selector::XPath is indeed most of what's needed. Nice job.
I have to thing about it, but at the very least I will add something in the docs about using HTML::Selector::XPath in order to use CSS selectors on XML/HTML modules.
mirod
amazing (Score:1)
I've been scraping HTML for a while (since sitescooper), and XPath is definitely the right way to do it, I think.