Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

gnat (29)

gnat
  (email not shown publicly)

Journal of gnat (29)

Thursday April 24, 2003
09:33 PM

Screenscraping Tables

[ #11837 ]
The HTML::TableContentParser module (available from your local CPAN outlet) is awesome. I wanted to get a text file containing the data in the O'Reilly editorial calendar. Currently you can only access it through the web, and what you get back is either an HTML table or an Excel spreadsheet (hork!).

About an hour of hacking later, I had a program that emitted valid XML. The hard parts were, as always:

  • figuring out what pages I had to hit to get cookies and a session ID and all the other black magic required
  • finding the credentials method on a user agent (I can remember that there is a method for authentication, I just can never remember what it's called)
  • decoding the javascript to figure out what URL was actually being requested, and with what parameters
  • finding the right table and extracting info from it

In the past, I'd have hacked the table by hand. But this time I elected to use HTML::TableContentParser, and it made the job a lot easier. The documentation's rather blurry on the data structure you get back, but I used Data::Dumper to display it and quickly figured out what I was working with.

I've found that a lot of my screenscraping programs have the same structure. I quickly write the code that fetches the first page and saves it to a file. I look at it to visually confirm that I'm downloading the right page. Then I use Getopt::Std to implement an option that lets me say "don't download the first page, just load it from the local file". This speeds up debugging while I'm figuring out how to parse the HTML. When I was scraping the ORA proposals database last year, I had two or three steps that I could skip if I'd already debugged that part of the code.

--Nat

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Any chance of seeing this perl code?
    I could learn a great amount from it, I'm sure...
    • Sure! I suppose I should have done more hackery to automatically determine the credentials() arguments from the URL, but I couldn't be buggered :-)

      #!/usr/bin/perl -w

      use LWP;
      use HTML::TableContentParser;
      use Getopt::Std;
      use strict;

      # username and password for ORA intranet
      my ($USERNAME, $PASSWORD) = ('CHANGE', 'ME');

      # where to store files.  change this!
      my $DIR = ($^O eq "darwin") ? '/Users/gnat/Ora/Paperwork/edcal'
                             

  • Recipe? (Score:2, Insightful)

    This would make a nice Cookbook recipe...

    --
    (darren)
    • D'oh, good point. I can't believe I didn't think of that. Thanks, applied. :-)

      --Nat

      • There's another module HTML::TableExtract to parse HTML tables. I have used this, and it is pretty nice. I haven't looked at HTML::TableContentParser, so can't really compare, yet.

        Also, look at WWW::Mechanize, which is really awesome for scraping web content. There is WWW::Mechanize::Shell, for writing quick scripts to this kinda stuff.

        Just some more info for you to chew on while you write that cookbook entry.

        /prakash

        • I spent a long time looking for data with column headings for HTML::TableExtract to work on. I finally found some census data [census.gov], but after half an hour of trying, I couldn't make H::TE grok the nested table headings. I finally gave up and just documented HTML::TableContentParser. Sorry!

          --Nat

  • Excel to XML (Score:3, Insightful)

    by darobin (1316) on 2003.04.29 10:41 (#19573) Homepage Journal

    If the Excel is usable, then you might want to try XML::SAXDriver::Excel at some point (or for a similar problem involving surviving in an office with M$ users).

    --

    -- Robin Berjon [berjon.com]