Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

ziggy (25)

ziggy
  (email not shown publicly)
AOL IM: ziggyatpanix (Add Buddy, Send Message)

Journal of ziggy (25)

Tuesday May 20, 2003
04:29 PM

Python for Screen Scraping

[ #12335 ]
Jacek Artymiak posted some random Python code to parse the O'Reilly product index.

I've been writing screen scrapers off and on for years. I've read enough Python code to understand this program. Yet, the style of this program really irks me. It's an example of something that works, yet is difficult to read. Without the brief introduction, and the URL embedded within the code, I would have no idea what this program did, and I probably wouldn't care. All I could identify from his program is that it examines an HTML page that contains <tr>, <td>,<a> tags and lots of occurrances of the string http://www.oreilly.com/catalog/

To prove to myself that I'm not being anti-Pythonic, I wrote a screen scraper to do something similar Perl. It took me about ten minutes, mostly because HTML::TableContentParser is such a kickass module, and partly because I can never remember the format of the data it returns. :-) Most screen scrapers I write these days use Data::Dumper while in development to (a) remind me what HTML::TableContentParser returns, and (b) to demonstrate where the content I want to examine is stored.

Here's my version. I've left the comments in, because that's how I wrote the code for myself. I think the intent of this program is much easier to divine based on the code shown below.

#!/usr/bin/perl5.8.0 -w

use strict;
use LWP::Simple;
use HTML::TableContentParser;

getstore("http://www.oreilly.com/catalog/prdindex.h tml", "prdindex.html")
    unless -e "prdindex.html";

open(my $index, "prdindex.html");
$/ = undef;

my $p = new HTML::TableContentParser;

## The catalog is the last table on the page
my $catalog = $p->parse(<$index>)->[-1]->{rows};

shift(@$catalog);    ## Remove the header row

my @fields = qw(title isbn price online_version examples);
my @books;

foreach my $row (@$catalog) {
    my %book;

    @book{@fields} = map {s/^\s+//; s/\s+$//; $_}
                        map {$_->{data}}
                            @{$row->{cells}};

    ## Clean up the data some more
    @book{qw(titleurl title)} = $book{title} =~ m/href="(.*?)">(.*?)</;

    ($book{examples}) = $book{examples} =~ m/href="(.*?)\s*"/;
    ($book{online_version}) = $book{online_version} =~ m/href="(.*?)"/;

    delete $book{examples} unless $book{examples};
    delete $book{online_version} unless $book{online_version};

    push(@books, \%book);
}

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.