use Perl Log In
WWW::Search Module Advisory
hfb writes "Recently Google requested that a module be removed from CPAN as it violated their terms of use agreement. The author agreed to remove the module without a fuss, but those of you who have modules for other search engines, who are considering writing one or who write crawlers should be aware of this development. The search engines likely don't mind sensible and courteous crawlers but cannot abide the DoS-like crawling that happens with poorly written clients. WWW::Search namespace, this means you."
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Agreement?
(Score:1, Insightful)( http://www.cantrell.org.uk/david | Last Journal: 2006.08.23 19:02 )
Re:Agreement?
(Score:4, Informative)( http://www.krellis.org/ | Last Journal: 2003.07.11 4:52 )
Guns don't kill people, people kill people
(Score:4, Insightful)As long as the module CAN be used in accordance with their TOS (ie: as long as I can use it to write a script which I use for Personal Use) then the module itself is not in violation. If they don't like the way some asshole is using the Module, they should go after the asshole.
I can write a meta-searching site that violates their TOS using nothing but Apache, /bin/sh, and lynx -- does that mean lynx violates their TOS and should be pulled from circulation?
(Admittedly, I don't know the details of what module was pulled, maybe it was called Apache::SearchGoogleWithYourOwnAds and did all the work for you to create a proxy to Google that showed their results with your Ads -- but i doubt it could have been that bad.)
Re:Guns don't kill people, people kill people
(Score:4, Interesting)( Last Journal: 2002.02.25 17:47 )
I don't they would have asked for it's removal had it not been a pressing issue. I get at least one fucktard a day who wrote a crawler to use on the cpan search engine that doesn't respect robots.txt or anything else for that matter effectively crippling the service for everyone else...idiots with a little perl and not a lot of common sense can ruin your day.
The author removed the module voluntarily. However, the others in the namespace would do well to consider the applications of their modules and compliance with the terms of service to avoid this sort of problem in the future.
Search engines like to provide a useful service without the added hassle of someone trying to hoover their database with 50 queries a second or more. I consider abusive crawlers to be a menace and a threat to freely available search engines like google, CPAN search and others.
Re:Guns don't kill people, people kill people
(Score:4, Interesting)I mean, if people are slamming their site with a module, taking the module off CPAN isn't going to stop them -- they've still got it, and they'll still use it.
There is definitely somethign to be said however for trying to make you modules play as nicely as possible -- having a section in the documentation on how to use the module responsably is good, but module writters might also want to consider putting "safety valves" in their code, that users have to go out of their way to open. That way you're doing your part to make your software play nice with the rest of the children, and you can point a clear finger at the user for disabling the safety feature.
I'm reminded of some code a buddy showed me a few years ago. There was YA Buffer Overflow hole in some software, and the person who found the hole had released a C program to exploit it (in the spirit of SATAN). If you compiled the code (or got a binary from someone) and used it without looking at the source to understand what it did, you would never notice the
#ifdef SCRIPT_KIDDIEblock that put the users name, email, IP, hostname, and a bunch of other really usefull information into the large string that was generated to overrun the buffer -- giving anyone who had patched the bug all the data they needed to track down the person trying to hack them.Perhaps people writting modules in the WWW::Search hierarchy could put similar data into X- headers without documenting the "feature" so Search engines can better block/track assholes abusing the module.
Hypocrisy.
(Score:2, Insightful)Let's rephrase that; a remote program (a web browser in a sense) visits their webpage, parse the data to keep only the url's of webpages that obviously Google doesn't own and only use that information that Google doesn't own. So why would this be a problem. And how is this different than Google crawling peoples web pages, caching their data and images.
This is not a DoS attack. You don't crawl google iteratively in parallel. It is a simple one page query.
You might argue that people can put some files to prevent search engines to index their pages. Don't forget, google extract copyrighted material from others pages, and a metasearch script extract only the data google doesn't own at all.
What was the module in question?
(Score:2, Insightful)( http://www.lies.com/jbc/ | Last Journal: 2003.05.27 13:00 )
What about my terms of use?
(Score:2, Interesting)The terms I stipulate are that anyone may access my homepage for any reason at any time by any means, automated or not, with the exception of Google or any of its agents. Google must pay a fee of $1 per hit for any access to any web page or graphic element on my domain [collinstarkweather.com], whether that access is through a browser, a spider, or a robot.
I'm looking forward to my first check!
-- My choice of computing platform is a symbol of my individuality and belief in personal freedom.
Re:What about my terms of use?
(Score:4, Insightful)( Last Journal: 2002.02.25 17:47 )
Grow up
Search engines are quite possibly the single most useful part of the internet. Try to spend a day without them. Google provides the service for free and largely advertisement-free as well. Unless you would prefer that Google and others resorted to subscription only or ad filled content it is not an unreasonable request to consider the consequences and potential misuses of these modules.
Google also respects the ban in robots.txt if you don't wish them to index your site. A lot of other crawlers don't.
THE AUTHOR VOLUNTARILY OFFERED TO REMOVE IT AT GOOGLES REQUEST. NO PUPPIES WERE KILLED IN THE FILMING OF THIS MOVIE.
Googlewhacking
(Score:2, Funny)( http://www.nexcerpt.com/ | Last Journal: 2002.10.17 16:08 )
Disclaimer: the term "Googlewhack" was invented by my employer, Gary Stock [unblinking.com] (his home page is where it all started). At one point he was getting hundreds of emails a day, people offering up their clever googlewhacks, arguing about scoring and legitimacy, describing programming hacks, etc. The scariest was somebody who said he was coding up a script to walk through
/usr/share/dict/words(or whatever) and submit all the word pairs to GoogleJeff Boes Hyper-real techno priest of Perl
Middle solution?
(Score:2, Interesting)Proxies?
(Score:2, Interesting)I do all of my browsing through a ad-removing proxy. I suppose they could claim my proxy is also in violation of the TOS, since my software reformats every page that I see.
Sad, really.