Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

miyagawa (1653)

miyagawa
  (email not shown publicly)
http://bulknews.vox.com/
AOL IM: bulknews (Add Buddy, Send Message)

Journal of miyagawa (1653)

Tuesday May 08, 2007
10:07 PM

Web::Scraper is released, the Perl port of Scrapi.rb

[ #33222 ]

Today I've been thinking about what to talk in YAPC::EU (and OSCON if they're short of Perl talks, I'm not sure), and came up with a few hours of hacking with web-content scraping module using Domain Specific Languages.

With help from guys on IRC channel and obra who gave a nice talk about DSL in Perl at YAPC::Asia, I whipped up a really small Web::Scraper module.

This is basically a Perl port of Ruby's scrapi toolkit and its API is intended to be similar to ruby's one. So you can write a script to parse Twitter's friend list and extract image URLs for them as:

use URI;
use Web::Scraper;
 
my $nick = shift || "miyagawa";
my $uri  = URI->new("http://twitter.com/$nick");
 
my $twitter = scraper {
    process 'a[rel="contact"]',
        'friends[]' => scraper {
            process 'a',   url => '@href', name => '@title';
            process 'img', src => '@src';
        };
    result 'friends';
};
 
my $friends = $twitter->scrape($uri);
 
use YAML;
warn Dump $friends;

I haven't looked at any internal code of scrapi.rb but looked at several examples on the web and confirmed that these scripts run with only slight modification(s). The module is very small amount of code, just 100 lines or so, with fun hacking of perl using local(), goto and function prototypes.

It's still in its alpha quality adn the API will be likely to change a lot, but enjoy and give me feedbacks!

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.