Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

acme (189)

acme
  (email not shown publicly)
http://www.astray.com/

Leon Brocard (aka acme) is an orange-loving Perl eurohacker with many varied contributions to the Perl community, including the GraphViz module on the CPAN. YAPC::Europe was all his fault. He is still looking for a Perl Monger group he can start which begins with the letter 'D'.

Journal of acme (189)

Thursday January 20, 2005
05:48 AM

Extracting books from recipes

[ #22801 ]
I did do a little research on large-scale spelling, but all the equations scared me. I'll look at it again in a bit. Instead, I spent a day trying to construct metadata about partially-structured data. OK, what I really did was try and extract book information from the recipes. This way I can link to Amazon using the Amazon Associates program, get 5% in in referral fees and be a millionaire by the time I'm 21. *cough*

Now it really was quite hard. Some recipes contained ISBNs. These were fairly easy to extract, and then I could use Amazon Web Services to extract the book title and an image. However, the vast majority of book references were free-form. This is slightly harder. I ended up taking a random sample of a couple hundred recipes and building a test suite of the correct book references. After a full day of heuristic building, it came up with mostly the correct results. I could then plug the title into AWS and get a related ISBN.

I really do like the Amazon Web Services. I haven't played with them until now and they do really expose an awful lot of Amazon's database. Also, notice in the previous links that it hasn't always been the exact book, but sometimes a similar one in the same category. OK, so sometimes it goes a little freaky but mostly it works out. It was tough, but I'd say it was a day well spent. And I don't think it's too evil to mention related books.

Whoops, you probably want some Perl content. I used Business::ISBN to validate the ISBNs and Net::Amazon to do lots of AWS queries. Lingua::EN::NamedEntity does a similar thing to my heuristics, but mine are more linked to my data. Note links to kobesearch as search.cpan.org has been almost down for the last week. Time for dimsum!

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.