Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • by jordan (120) on 2003.11.14 18:01 (#25811) Homepage Journal
    That would spider a site, download all the pages and change the links to your on-disk copy. Those used to be really commonly used when people used slow 36K MODEM links on the web, but I haven't used one in years and I can't recall the names of any such programs. My naive google searching for something like this hasn't immediately turned up anything. Anybody know of a good one? Seems like a fairly easy perl program to write, actually.

    Also, I used to use Plucker [] for Palm devices, which does this, but downloads the result to a Palm device. I say, used to, because my work gave me a fairly tricked out Pocket PC to use. Bah! I really like the Palm better, but this Pocket PC is SOO much better than my old palm in terms of sound and memory and display and wirelessness. I sadly gave my old Palm III to my daughter.

    • by ziggy (25) on 2003.11.14 18:11 (#25813) Journal
      Seems like a fairly easy perl program to write, actually.
      Actually, it's harder to write than you'd think. There are lots of edge cases to handle: not only do you need to fetch images and munge the <img ...> tags, but all of the frames, iframes, CSS stylesheets, media files and so on. Oh, and don't forget to rationalize all of the URLs. LWP can convert relative to absolute URLs for you, but you still need to find either and replace them with something relative on your filesystem.

      I tried it a few times. It's not the 30-minute hack it appears to be. You're probably better off using GNU wget [] instead of rewriting it from scratch in Perl.

      • Hey, cool, I didn't know wget would convert links for you. Thanks!

        Actually, I can think of a good use for this. There're a lot of web-based docs that I've used that don't come in the form of one big HTML file. It's sometimes slow to browse these from work. You get the idea...

        Hmmmm... Can I maintain a bunch of pages on my local hard drive compressed such that they will uncompress when I access them from a browser? I could run Apache on my desktop, sure, but how do I build something that would support

        • With a little (mod-perl) URL translation and content filtering, you could easily read some compressed files and present them uncompressed. I think there even use to be a browser that could gunzip the right sort of response.
      • Well, getting 80% of the problem done is about 30 minutes, give or take a day. Since I tend to stay away from sites with goofy features, that 80% of the problem is 99% of my experience, so I don't have to do the other 20% of the problem just to get the 1% benefit.

        In fact, I had to do this to convert some IE web archives so I can read them.
    • I have such programs on my computer, but I don't always get to use my computer.

      On MacOS X I use either SiteSucker or WebDevil.

      Last night, however, I was stuck on an Army computer.