Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Thursday August 29, 2002
11:58 AM

milking 'em and stringing 'em

[ #7394 ]

Someone I know was complaining about a writer's conference search site that had a really awful search interface. Says I, "Oh, well, I can pull all those pages down, parse them, and then build you a little database you can search."

Woah.

Getting the pages downloaded turned out to be a little tricky. I suspected that there was something more complicated going on when I couldn't just do a GET for the URL; having not read quite enough documentation, I didn't realize how easy it was to add cookie support to LWP::UserAgent, so I futzed around with wsnitch, trying to get it to build so I could watch the HTTP interaction. I got it to build, but then it was jusr segfaulting. Okay, I'll debug that. gdb doesn't work.

Sigh. I diddled around until I found and fixed the bugs in the Gentoo gdb install, found and fixed the wsnitch problem and had it all working. Just about that time I figured that I should google for LWP and cookies. D'oh. At least wsnitch and gdb are working.

Anyway, once I got that far, I could read all the pages, but they were full of nasties. Tables upon tables, unclosed font tags, and lots of other nastiness. I was able to match out some of the boilerplate, but parsing the data required HTML::TreeBuilder and lots of diddling in the debugger to finally parse the data.

Next is using DBI for the very first time to build and query a database. At least I've learned a lot more, anyway.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.