About an hour of hacking later, I had a program that emitted valid XML. The hard parts were, as always:
In the past, I'd have hacked the table by hand. But this time I elected to use HTML::TableContentParser, and it made the job a lot easier. The documentation's rather blurry on the data structure you get back, but I used Data::Dumper to display it and quickly figured out what I was working with.
I've found that a lot of my screenscraping programs have the same structure. I quickly write the code that fetches the first page and saves it to a file. I look at it to visually confirm that I'm downloading the right page. Then I use Getopt::Std to implement an option that lets me say "don't download the first page, just load it from the local file". This speeds up debugging while I'm figuring out how to parse the HTML. When I was scraping the ORA proposals database last year, I had two or three steps that I could skip if I'd already debugged that part of the code.
--Nat
any chance... (Score:2, Insightful)
I could learn a great amount from it, I'm sure...
Re:any chance... (Score:3, Insightful)
Recipe? (Score:2, Insightful)
This would make a nice Cookbook recipe...
(darren)
Re:Recipe? (Score:2)
--Nat
Re:Recipe? (Score:1)
Also, look at WWW::Mechanize, which is really awesome for scraping web content. There is WWW::Mechanize::Shell, for writing quick scripts to this kinda stuff.
Just some more info for you to chew on while you write that cookbook entry.
Re:Recipe? (Score:2)
--Nat
Excel to XML (Score:3, Insightful)
If the Excel is usable, then you might want to try XML::SAXDriver::Excel at some point (or for a similar problem involving surviving in an office with M$ users).
-- Robin Berjon [berjon.com]
Reply to This