Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Friday January 23, 2004
07:13 AM

Leave the HTML alone!

[ #16970 ]

I have a lot of web scrappers that pull down information and display it to me so I don't have to do a lot of pointing-and-clicking and scrolling.

My Mac has not been on the network for a while, so when I tried out my scrappers the other day, and most of them failed because the regular expressions do not end up matching anything.

Debugging this thing while I am paying $6/hour is a bit more than I want to pay, so I have a new trick.

All of these scrappers cache a lot of intermediate results in a hidden directory---so much so that some may think I overdo it a bit. Despite that, I have never saved the original web page.

I thought I could just save the web page from my browser, but I tried to save them as "Web Page Complete", which munges the HTML so when I look at it offline, it finds all of the right supporting files on my computer. The HTML in "Web Page Complete" is not the HTML my regexen see.

I modified these programs to save the real HTML so I can look at it later, but now I am a bit sad because this used to be so easy, and now there are more layers to looking at the plain source.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Save as HTML (Score:2, Insightful)

    If you want to save the original contents of web pages, instead of what browsers make of it, the simplest solution is to make a little script using LWP/LWP::Simple (getstore() is virtually ideal), or, if that's not smart enough, WWW::Mechanize.