Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • Agreement? (Score:1, Insightful)

    Call me an idiot, but I'm buggered if I can see any such agreement on google's site. It ain't on the front page, and it ain't on the results page.
  • Assuming the issue here was the "Personal Use Only " and "No Automated Querying" sections, I really don't see how the existence of a module can violate Gogles TOS (unless there's some automated test-case that they take issue to).

    As long as the module CAN be used in accordance with their TOS (ie: as long as I can use it to write a script which I use for Personal Use) then the module itself is not in violation. If they don't like the way some asshole is using the Module, they should go after the asshole.

    I can write a meta-searching site that violates their TOS using nothing but Apache, /bin/sh, and lynx -- does that mean lynx violates their TOS and should be pulled from circulation?

    (Admittedly, I don't know the details of what module was pulled, maybe it was called Apache::SearchGoogleWithYourOwnAds and did all the work for you to create a proxy to Google that showed their results with your Ads -- but i doubt it could have been that bad.)

    • I don't they would have asked for it's removal had it not been a pressing issue. I get at least one fucktard a day who wrote a crawler to use on the cpan search engine that doesn't respect robots.txt or anything else for that matter effectively crippling the service for everyone else...idiots with a little perl and not a lot of common sense can ruin your day.

      The author removed the module voluntarily. However, the others in the namespace would do well to consider the applications of their modules and compliance with the terms of service to avoid this sort of problem in the future.

      Search engines like to provide a useful service without the added hassle of someone trying to hoover their database with 50 queries a second or more. I consider abusive crawlers to be a menace and a threat to freely available search engines like google, CPAN search and others.

      • I still say they should be going after the users, not the code.

        I mean, if people are slamming their site with a module, taking the module off CPAN isn't going to stop them -- they've still got it, and they'll still use it.

        There is definitely somethign to be said however for trying to make you modules play as nicely as possible -- having a section in the documentation on how to use the module responsably is good, but module writters might also want to consider putting "safety valves" in their code, that users have to go out of their way to open. That way you're doing your part to make your software play nice with the rest of the children, and you can point a clear finger at the user for disabling the safety feature.

        I'm reminded of some code a buddy showed me a few years ago. There was YA Buffer Overflow hole in some software, and the person who found the hole had released a C program to exploit it (in the spirit of SATAN). If you compiled the code (or got a binary from someone) and used it without looking at the source to understand what it did, you would never notice the #ifdef SCRIPT_KIDDIE block that put the users name, email, IP, hostname, and a bunch of other really usefull information into the large string that was generated to overrun the buffer -- giving anyone who had patched the bug all the data they needed to track down the person trying to hack them.

        Perhaps people writting modules in the WWW::Search hierarchy could put similar data into X- headers without documenting the "feature" so Search engines can better block/track assholes abusing the module.

        • Perhaps people writting modules in the WWW::Search hierarchy could put similar data into X- headers without documenting the "feature" so Search engines can better block/track assholes abusing the module.

          Even outside of the context of this discussion, this is a fabulous idea. Not so much to enable search engines to block abusers, but because software that uses the Net should be self-identifying, especially software that iteratively traverses a site.


          • We don't need no stinkin' X- headers. That's what User-Agent is for.
            • Most of the CPAN Modules I've seen that act as HTTP clients set the User-Agent, but they also have a documented method for the User to override it (in case they need to masquarade as a particuar User-Agent.

              I'm suggesting some headers that would be completely undocumented, and could only be overridden using an undocumented method. Most people would be completely unaffected (since the extra X headers would be ignored) and anyone who was affected wouldn't have too much trouble looking at the source to figu

  • Hypocrisy. (Score:2, Insightful)

    Let me make sure if I understood this correctly. Google doesn't let some program to query their website, then retrieve the search results, parse them and use them. Like a metasearch script that extracts information from multiple search engines and combines them. They supposedly doesn't allow people doing that.
    Let's rephrase that; a remote program (a web browser in a sense) visits their webpage, parse the data to keep only the url's of webpages that obviously Google doesn't own and only use that information
    • Hmm, let me sure *I* understand this correctly. You are proposing that it is ok to steal their results (i.e., cost them money) without *any* compensation? How exactly do you expect them to stay in business? Their business model is simple: either you pay per query (like Yahoo), or it's free and you "pay" (in the aggregate) by viewing/clicking on ads. Automated queries without pay don't make any sense economically.

      Re your comment on ownership of the information, you're missing the point too--Google does

  • Just out of curiousity, what was it that was removed, and how much can we know about what it was that was deemed DoS-y about it? It would make it easier to assign an appropriate karmic burden to the Googlemeisters if I knew what they objected to.
  • Curious as to what kind of precedent Google is encouraging, I have posted my own terms of use [].

    The terms I stipulate are that anyone may access my homepage for any reason at any time by any means, automated or not, with the exception of Google or any of its agents. Google must pay a fee of $1 per hit for any access to any web page or graphic element on my domain [], whether that access is through a browser, a spider, or a robot.

    I'm looking forward to my first check!

    -- My choice of computing platform is a symbol of my individuality and belief in personal freedom.
    • The problem is that Google has more and better lawyers and they are not afraid to use them :)
    • by hfb (74) on 2002.03.03 11:41 (#5375) Homepage Journal

      Grow up

      Search engines are quite possibly the single most useful part of the internet. Try to spend a day without them. Google provides the service for free and largely advertisement-free as well. Unless you would prefer that Google and others resorted to subscription only or ad filled content it is not an unreasonable request to consider the consequences and potential misuses of these modules.

      Google also respects the ban in robots.txt if you don't wish them to index your site. A lot of other crawlers don't.


      • A module that simply goes to google, does a search, and displays or parses the result is not violating their terms of service AFAICT, thought it could be used to do so.

        Of course, their terms of service are very unclear here. What is "automated searching"?

        Anyway, what's scary to me is that a corporation is telling people what kind of code they can and cannot write and distribute.

        But of course, without actually knowing what the removed module did I can't really say too much, though I am sympathetic to peo
        • I don't know any details other than the author voluntarily agreed to remove it from CPAN. I found it interesting and sympathise with Google as abusive crawlers are difficult to identify and stop as well as consume far more resources than is 'fair use'. I don't know that removing the module from distribution will ease or solve the problem but I certainly can't blame them for trying.

          Again, the point isn't whether or not Google is a big bad old mean corporation for picking on a particular module...the point

        • I worked for a search engine for a few years. There were days when 30% of our traffic was meta-search engines, automated placement checkers, and toolbar search things. None of which displayed the banner ads we were putting up. That's a lot of traffic, with a lot of cost associated with it, with no revenue from it at all.

          Like it or not, advertising pays some of the bills, and automated search tools skip the advertising. And, while the ads don't pay for all the traffic (I'd wager a lot that Google's public

  • This may be related to Googlewhacking [].

    Disclaimer: the term "Googlewhack" was invented by my employer, Gary Stock [] (his home page is where it all started). At one point he was getting hundreds of emails a day, people offering up their clever googlewhacks, arguing about scoring and legitimacy, describing programming hacks, etc. The scariest was somebody who said he was coding up a script to walk through /usr/share/dict/words (or whatever) and submit all the word pairs to Google ... yipe!
    Jeff Boes Hyper-real techno priest of Perl
  • How about someone made a module with google's cooperation, using some specific interface. That way, Google could monitor how many queries are run using the engine. We'll stay in line, and Google will be protected from leeches making Google proxies (they will be able to spot the popular ones when looking through their log files).
  • Proxies? (Score:2, Interesting)

    According to Google:
    " You may not take the results from a Google search and reformat and display them, "
    I do all of my browsing through a ad-removing proxy. I suppose they could claim my proxy is also in violation of the TOS, since my software reformats every page that I see.

    Sad, really.