Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Once you've got a link checker / site spider (type 3), how much work is it to make a general purpose onconstrained spider (type 4)? Not much.

    It sounds like you need to handle the issue of senselessly hammering someone's webserver with the third part of the chapter. Perhaps all you need to do with the fourth part is outline what needs to be changed (as an exercise for the reader) and continue with the litany of reasons why you shouldn't do this unless you really know what you're doing?

    • Once you've got a link checker / site spider (type 3), how much work is it to make a general purpose onconstrained spider (type 4)?

      Actually, I think it's more the other way round. An unconstrained spider is easy to write, it's adding the constraints that's harder ;)

      --

      -- Robin Berjon [berjon.com]

      • Yes, adding the constraints is harder, but the one constraint that distinguishes type 3 from type 4 isn't hard at all. All you need to do it check that each URL starts with a particular prefix before you fetch it.

        The more difficult constraints, such as avoiding infinite URL spaces and not hammering a server too hard, apply to type 3 just as much as type 4. But I guess you don't need to worry so much about them when you're running a type 3 program on a site that you control, since you'll be aware of any p
        • True. Not hammering a site is rather simple, just sleep correctly (using LWP::RobotUA). Avoiding infinite space URLs otoh is hard and can probably only rely on heuristics (don't request more than n docs from server foo, don't request URLs that are longer than n chars, etc) or on the owner of the site having setup a robots.txt that you can read with WWW::RobotRules.

          --

          -- Robin Berjon [berjon.com]