NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.
All the Perl that's Practical to Extract and Report
Stories, comments, journals, and other submissions on use Perl; are Copyright 1998-2006, their respective owners.
Third order spiders (Score:2)
It sounds like you need to handle the issue of senselessly hammering someone's webserver with the third part of the chapter. Perhaps all you need to do with the fourth part is outline what needs to be changed (as an exercise for the reader) and continue with the litany of reasons why you shouldn't do this unless you really know what you're doing?
Re:Third order spiders (Score:2)
Once you've got a link checker / site spider (type 3), how much work is it to make a general purpose onconstrained spider (type 4)?
Actually, I think it's more the other way round. An unconstrained spider is easy to write, it's adding the constraints that's harder ;)
-- Robin Berjon [berjon.com]
Re:Third order spiders (Score:1)
The more difficult constraints, such as avoiding infinite URL spaces and not hammering a server too hard, apply to type 3 just as much as type 4. But I guess you don't need to worry so much about them when you're running a type 3 program on a site that you control, since you'll be aware of any problems (or should be) and can stop the program when they come up.
A type 3 program that's running on a site you don't control, however, seems to be just as troublesome as a type 4 program, so I'm not sure the distinction is all that useful.
Reply to This
Parent
Re:Third order spiders (Score:2)
True. Not hammering a site is rather simple, just sleep correctly (using LWP::RobotUA). Avoiding infinite space URLs otoh is hard and can probably only rely on heuristics (don't request more than n docs from server foo, don't request URLs that are longer than n chars, etc) or on the owner of the site having setup a robots.txt that you can read with WWW::RobotRules.
-- Robin Berjon [berjon.com]