Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

miyagawa (1653)

miyagawa
  (email not shown publicly)
http://bulknews.vox.com/
AOL IM: bulknews (Add Buddy, Send Message)

Journal of miyagawa (1653)

Tuesday October 03, 2006
04:56 AM

Screen scraping w/ XPath and CSS Selector in Action

[ #31204 ]

Here is more example code that demonstrates how to use XPath and CSS Selector to do screen scraping without using nasty regular expressions.

The task is "Access search.cpan.org for XML and extract 1) how many modules are there and 2) link to the PODs with module names"

There you go:

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Encode;
use HTML::Selector::XPath;
use HTML::TreeBuilder::XPath;
use WWW::Mechanize;
 
binmode STDOUT, ":utf8";
 
my $mech = WWW::Mechanize->new;
$mech->get("http://search.cpan.org/search?query=XML&mode=a ll");
 
my $count = $mech->xpath(q|//div[@class='t4']/small/b[3]|);
print "Count: ", $count->content->[0], "\n";
 
my @links = $mech->selector("p > a:first-child");
for (@links) {
    print "Module: ", $_->content->[0]->content->[0], "\n";
    print "Link: ", $_->attr('href'), "\n";
}
 
sub WWW::Mechanize::selector {
    my($mech, $selector) = @_;
    $mech->xpath(HTML::Selector::XPath->new($selector)->to_xpath);
}
 
sub WWW::Mechanize::xpath {
    my($mech, $xpath) = @_;
 
    my @ct = $mech->response->header('Content-Type');
 
    my $content;
    if ($ct[0] && $ct[0] =~ /charset=([\w\-]+)/) {
        $content = decode($1, $mech->content);
    } else {
        $content = decode_utf8($mech->content);
    }
 
    my $tree = HTML::TreeBuilder::XPath->new;
    $tree->parse($content);
    $tree->eof;
 
    my @nodes = $tree->findnodes($xpath);
    return wantarray ? @nodes : $nodes[0];
}

Pretty simple and maintainable, but a couple of things:

1) WWW::Mechanize::selector and ::xpath would be pretty useful. The code is doing the monkey pathc but sounds like it's better to create a WWW::Mechanize plugin to hook HTML::TreeBuilder(::XPath) -- UPDATE: it's now impemented as WWW::Mechanize::TreeBuider on CPAN

2) Using CSS selector from HTML::TreeBuilder::XPath would be a win even if you don't use WWW::Mechanize. Probably 2 lines of code for a new module?

3) Guessing charset from HTTP response header could be separated out to a separate module (HTTP::Response::GuessCharset?). This can be more robust using what we use in Plagger::Util::decode_content, which detects charset code even from meta tag, XML declaration and using Encode::Detect.

4) I hate HTML::Element's content->[0] and content_list stuff. All I want is just "content of the children as HTML (or text)" and "attributes as a hash reference" ($elem->all_external_attr does this).

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • If that's possible, I would be totally happy to include CSS selectors in HTML::TreeBuilder::XPath (and actually even in XML::XPathEngine). I would love the module to auto-detect which query language is used, but I don't think that's possible, as the syntax overlap.

    --
    mirod
    • Where does CSS::SAC fit into this discussion? Thanks, Christopher
      • Hm, I haven't looked at CSS::SAC. Looks like it's a SAX parser for CSS? My code does use CSS Selector as just a replacdement of XPath and the code can probably make use of CSS Selector Parser to be complete.
      • Uh, I used Google Code Search to find the probably duplicated work done in CSS::SAC [google.com], in January 2005.

        Looks like CSS::SAC on CPAN is not updated for a long time (the last update is September 2004) and it's not a bad thing to have a separate, pure perl (and independent of any CPAN module) would not be a bad thing, though.
        • Indeed, I just thought I'd point it out as I have been looking for something in perl as good as ScrAPI as I don't have the cycles to write one and haven't yet (with CSS::SAC) the closest. However if we can build something better I am happy. :-) Christopher
    • That's totally possible with just a few lines of code, and yeah, auto-detecting selectors from xpath would be impossible. I'm not sure including the feature into H::TB::XPath is the right thing to do. Maybe it is.
      • I hadn't looked at this at all, but I see that your HTML::Selector::XPath is indeed most of what's needed. Nice job.

        I have to thing about it, but at the very least I will add something in the docs about using HTML::Selector::XPath in order to use CSS selectors on XML/HTML modules.

        --
        mirod
  • This is very nice! Thanks!

    I've been scraping HTML for a while (since sitescooper), and XPath is definitely the right way to do it, I think.