I'm trying to put some neat cookbook things using Web::Scraper on this journal. They'll eventually be incoropolated into the module document like Web::Scraper::Cookbook, but I'll post here for now since it's easy to update and give a permalink to.
The easiest way to keep up with these hacks would be to subscribe to the RSS feed of this journal, or look at my del.icio.us links tagged 'webscraper' (which has an RSS feed too).
Want to contribute your experience? Tag them webscraper on del.icio.us so I can follow.
Yesterday I played with What cameraphone do they use? which extracts photo files from blog sites, and used the following code to extract image files.
# extract A links that has IMG inside
my $s = scraper {
process "a>img", "links[]" => sub { $_->parent->attr('href') }
};
With "a>img" CSS selector, you'll get 'img' tags that follows 'a' tags, then call $_->parent to get its parent tag to retrieve the 'href' attribute.
> echo '<a href="foo.jpg"><img src="bar.jpg"></a>' | scraper
scraper> process "a>img", "links[]" => sub { $_->parent->attr('href') }
scraper> y
---
links:
- foo.jpg
To be more accurate, so that it won't pick up A links that actually don't link to
process q{//a[contains(@href,'.jpg')]/img},
'links[]' => sub { $_->parent->attr('href') };
contains() XPath expression makes sure that the href attribute actually contains ".jpg" somewhere, so it won't pick up A tag linking to HTML file etc.
Better living through superior XPath (Score:1)
Note that part of the improved expressiveness of XPath over CSS is that you don’t need to match on the deepest node you are trying to match; you can always climb back up the three, or use assertions:
The assertion is clearly the more straightforward way to say what you want here. It says “m
Re: (Score:2)
Oh yes, and that's why I prefer CSS selector!
My brain doesn't have enough space to remember stuff like the complete XPath syntax that I rarely use. I guess I should just print out XPath cheat sheet somewhere, though.
Thanks for the "superior" XPath pointer anyway. That works and that's exactly why I keep the XPath support in Web::Scraper
Re: (Score:1)
Ah, hehe. For me I guess it’s much like with the dereferencing punctuation in Perl: it has a few consistent rules that compose cleanly. So it doesn’t take up any space in my head at all. To each his own. :-)
scRUBYt! (Score:1)
Just in case you wouldn't know about this fanstastic scraping tool: http://scrubyt.org/getting-started-with-scrubyt/ [scrubyt.org]
I'm sure there's a lot of ideas in that application you could include in your module.
Regards,
Relipuj.
Re: (Score:2)
Re: (Score:1)
Personally i don't mind too much about the DSL and the OO interface. An imported function is perfectly ok.
What i love about it, it's that you just give it hints of what you want ("APPLE M9801LL..." and the "71.99" in the example given), and it guesses, correctly in general, what you want to extract...
But now i'd guess it is a lot of work too.
Your module
Not really on-topic (Score:1)
Re: (Score:2)
Re: (Score:1)