Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

miyagawa (1653)

miyagawa
  (email not shown publicly)
http://bulknews.vox.com/
AOL IM: bulknews (Add Buddy, Send Message)

Journal of miyagawa (1653)

Monday September 03, 2007
11:59 AM

Web::Scraper hacks #1: Extract links linking to images

[ #34325 ]

I'm trying to put some neat cookbook things using Web::Scraper on this journal. They'll eventually be incoropolated into the module document like Web::Scraper::Cookbook, but I'll post here for now since it's easy to update and give a permalink to.

The easiest way to keep up with these hacks would be to subscribe to the RSS feed of this journal, or look at my del.icio.us links tagged 'webscraper' (which has an RSS feed too).

Want to contribute your experience? Tag them webscraper on del.icio.us so I can follow.

Yesterday I played with What cameraphone do they use? which extracts photo files from blog sites, and used the following code to extract image files.

# extract A links that has IMG inside
my $s = scraper {
    process "a>img", "links[]" => sub { $_->parent->attr('href') }
};

With "a>img" CSS selector, you'll get 'img' tags that follows 'a' tags, then call $_->parent to get its parent tag to retrieve the 'href' attribute.

> echo '<a href="foo.jpg"><img src="bar.jpg"></a>' | scraper
scraper> process "a>img", "links[]" => sub { $_->parent->attr('href') }
scraper> y
---
links:
  - foo.jpg

To be more accurate, so that it won't pick up A links that actually don't link to .jpg files, you can write a bit complex XPath expression:

process q{//a[contains(@href,'.jpg')]/img},
  'links[]' => sub { $_->parent->attr('href') };

contains() XPath expression makes sure that the href attribute actually contains ".jpg" somewhere, so it won't pick up A tag linking to HTML file etc.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Note that part of the improved expressiveness of XPath over CSS is that you don’t need to match on the deepest node you are trying to match; you can always climb back up the three, or use assertions:

    # Tree navigation:
    process '//a[contains(@href,".jpg")]/img/..', 'links[]' => '@href';

    # Any valid XPath is also valid in an assertion:
    process '//a[img][contains(@href,".jpg")]', 'links[]' => '@href';

    The assertion is clearly the more straightforward way to say what you want here. It says “m



    • Oh yes, and that's why I prefer CSS selector!

      My brain doesn't have enough space to remember stuff like the complete XPath syntax that I rarely use. I guess I should just print out XPath cheat sheet somewhere, though.

      Thanks for the "superior" XPath pointer anyway. That works and that's exactly why I keep the XPath support in Web::Scraper :)
      • Ah, hehe. For me I guess it’s much like with the dereferencing punctuation in Perl: it has a few consistent rules that compose cleanly. So it doesn’t take up any space in my head at all. To each his own. :-)

  • Hello,

    Just in case you wouldn't know about this fanstastic scraping tool: http://scrubyt.org/getting-started-with-scrubyt/ [scrubyt.org]

    I'm sure there's a lot of ideas in that application you could include in your module.

    Regards,
    Relipuj.
    • Yes, I've taken a look at it as well, and found scrapi API easier to implement. Making Web::Scraper backend a complete OO and providing different DSL dialects on top of it is a big TODO :)
      • This was just a suggestion ;-) being an occasional scripter (and bad at it probably), i realize it's a big thing to do (or i probably cannot realize it ;-).

        Personally i don't mind too much about the DSL and the OO interface. An imported function is perfectly ok.

        What i love about it, it's that you just give it hints of what you want ("APPLE M9801LL..." and the "71.99" in the example given), and it guesses, correctly in general, what you want to extract...

        But now i'd guess it is a lot of work too.

        Your module
  • But after reading your slides, I got religion real quick on Web::Scraper. Even presented on it to my Perl Mongers group. Thanks!
    • Oh that's awesome. Which perl mongers?
      • Purdue Perl Mongers in West Lafayette, IN. The sad part is that I got it working and was testing and writing the thing in the two hours before the meeting, so I don't really have my head around the syntax. The sadder part is that my mashine became unstable after I left so I couldn't SSH and look at example code and demo it. The saddest part was that there were no new members because the promotion machine got bolloxed. The happy part is that I essentially get a do-over because of all that.