Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

miyagawa (1653)

  (email not shown publicly)
AOL IM: bulknews (Add Buddy, Send Message)

Journal of miyagawa (1653)

Friday September 14, 2007
07:19 PM

Web::Scraper 0.14

[ #34457 ]

Web::Scraper 0.14 is released along with a couple of neat features.

First of all, I incorpolated HTML::Tagset's linkElements hash into '@attr' accessor of elements, so if you do this:

$s = scraper { process "a", "links[]" => '@href' };

because a@href is known to be link elements, they're automatically converted to absoltue URI using as a base URI, even if the value of 'href' is relative.

Prior to 0.14 you had to write:

my $base = URI->new("")
$s = scraper {
     process "a",
         "links[]" => sub { URI->new_abs($_->attr('href'), $base) };

but you don't need to do that anymore. The same thing happens to all tags known as link elements, like img@src, script@src etc. If you use $s->scrape(\$html) after retrieving $html content from somewhere else, you can pass the base URI as a 2nd parameter for scrape() like:

$mech = WWW::Mechanize->new;
my $s = scraper { ... };
$s->scrape(\$mech->content, $mech->url);

Note that if the HTML content has 'base' tag, the URI absolutification might fail. In that case, you might want to use HTML::ResolveLink from CPAN to fixup the HTML before feeding it into Web::Scraper.

Second, I added a handy shortcut 'TEXT' and its alias 'RAW', to get the HTML data inside the matched tag. As seen on Web::Scraper hack #2, the text node inside script and style tags can't be retrieved using 'TEXT' because they're not technically text. 'HTML' shortcut is basically a shortcut to $_->as_HTML but it cuts the outermost tag (the matched tag itself) so it's more useful.

So the code in hack #2 can be now as simple as:

my $s = scraper {
    process "script", "code" => 'RAW';

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.