Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

TorgoX (1933)

TorgoX
  sburkeNO@SPAMcpan.org
http://search.cpan.org/~sburke/

"Il est beau comme la retractilité des serres des oiseaux rapaces [...] et surtout, comme la rencontre fortuite sur une table de dissection d'une machine à coudre et d'un parapluie !" -- Lautréamont

Journal of TorgoX (1933)

Saturday March 16, 2002
04:37 AM

Arachnia!

[ #3586 ]
Dear Log,

I spent all day poking at writing a chapter on spiders.

I think there's roughly four kinds of LWP programs:

  1. Programs that get one object off the web and process it (i.e., save it, whatever).
  2. Programs that get one object off the web, and find everything it links to, and process (save, etc) those.
  3. Programs that get a page, process it, look at everything that it links to that's on the same host, and get and process all those, and everything that they link to, and everything that those link to, recursively.
  4. Programs that get a page, process it, look at everything that it links to wherever it is on the Web, and get and process all of those, and everything that they link to, and everything that those link to, recursively.

    I constantly write programs that do the first two, but I don't generally call them "spiders". I generally reserve the term "spider" for the last two.

    I can imagine writing a program that does the third (a single-site spider), and indeed I'm doing so for the chapter, and I think it'll comprise the meat of the chapter, showing off LWP::RobotUA.

    But it seems that many people want to write a program like the fourth -- a freely-traversing spider -- and for them I'm trying to muster something more useful than just saying "DON'T! [endchapter]".

    After spending much of the day watching the cursor blink ON and OFF and ON and OFF, I think that that section will be "Don't, because... [fifty good reasons]".

    Because just about everyone I know who admins a server, has had some numb-skull's useless aimless spider come and hammer their server senseless for just no good reason. "I was just searching the whole web for pages about Duran Duran, is all! How was I supposed to know your host would contain an infinite URL-space, in its events calendar site?"

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Once you've got a link checker / site spider (type 3), how much work is it to make a general purpose onconstrained spider (type 4)? Not much.

    It sounds like you need to handle the issue of senselessly hammering someone's webserver with the third part of the chapter. Perhaps all you need to do with the fourth part is outline what needs to be changed (as an exercise for the reader) and continue with the litany of reasons why you shouldn't do this unless you really know what you're doing?

    • Once you've got a link checker / site spider (type 3), how much work is it to make a general purpose onconstrained spider (type 4)?

      Actually, I think it's more the other way round. An unconstrained spider is easy to write, it's adding the constraints that's harder ;)

      --

      -- Robin Berjon [berjon.com]

      • Yes, adding the constraints is harder, but the one constraint that distinguishes type 3 from type 4 isn't hard at all. All you need to do it check that each URL starts with a particular prefix before you fetch it.

        The more difficult constraints, such as avoiding infinite URL spaces and not hammering a server too hard, apply to type 3 just as much as type 4. But I guess you don't need to worry so much about them when you're running a type 3 program on a site that you control, since you'll be aware of any p
        • True. Not hammering a site is rather simple, just sleep correctly (using LWP::RobotUA). Avoiding infinite space URLs otoh is hard and can probably only rely on heuristics (don't request more than n docs from server foo, don't request URLs that are longer than n chars, etc) or on the owner of the site having setup a robots.txt that you can read with WWW::RobotRules.

          --

          -- Robin Berjon [berjon.com]