Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Tuesday June 06, 2006
12:31 PM

The folly of counting search hits

[ #29827 ]

Measuring anything by web search hits is worse then lying with statistics. There are three major things wrong with taking a number of search hits and attaching meaning to it.

  1. The search terms mean a great deal
           
  2. There's no way to know what a hit means without looking at it
                   
  3. Web pages can express many different opinions, all of which are included in the overall number of hits

There's no way to know what a search hit actually means without inspecting it. I'm not talking specifically about the term, although that has problems too. I'll assume that when a search engine says it found a certain number of hits, it means unique hits that aren't the same thing repeated in multiple places. Additionally, I'll assume that each hit is a genuine resource and not a page designed to draw attention to affliate ads and other revenue generators. This is an extremely generous position.

If I simply search for "Perl", I get a certain number of hits. Different search
engines give different number of hits, done on June 4.

    Google    367,000,000
    Yahoo      62,100,000
    MSN        19,291,407

The discrepency can have several causes, including:

  1. Some search engines index too little
  2. Some search engines index too much
  3. Search engines counts don't mean anything
  4. Only Google gets it right

No matter how I explain the difference, how do I know that any failing in one search engine isn't found in another one? What supports the reliability of any of these numbers, in any case, other than simply believing them without questioning them? The numbers might be reliable, but how do I reproduce the result? The major search engines support each other. I have no way to verify the total number of hits.

What do those hits really mean, though? I'll assume that the hits are a reliable number, which is a generous concession that still won't help. Consider some other Google searches that should be subsets of the above Google result.

    "Perl sucks"            984
    "Perl rocks"          2,560
    "I hate Perl"        10,700
    "I love Perl"        17,200
    Perl write-only     117,000
    perl "eric raymond" 163,000

Simply searching for "Perl" doesn't tell me anything about Perl because I don't get any context. Are those hits from people praising Perl or deriding Perl? Those 163,000 hits where "perl" shows up with "eric raymond" probably don't make Perl look very good. That's probably the same with "perl" and "write-only". If I want to show how good Perl is, I have to remove all the hits that say it isn't good. Does anyone have time to look through 367 billion pages to figure that out? Maybe 366 billion of those pages express an extreme distaste for Perl.

Let's take this a bit further. Are the Beatles really more popular than Jesus? Not according to Google. Furthermore, the Rutles are virtually insignificant to Rod, so we can set that straight after all these years.

    beatles           83,100,000
    jesus            221,000,000
    rutles               438,000
    rod              142,000,000

Now, to really hit you over the head with the hammer, I google some names from World War II. Does anyone want to make the claim that Hitler was more popular that Eisenhower or Patton? Or even Stalin and Eisenhower combined? There probably are people who do, and they'd be able to do that if they simply counted hits. However, as I've just shown, I need context. I can't assume that every hit is a hit in favor of the topic, or is even about the topic, since I don't think Field Marshal Montgomery was as popular as these numbers show. I can't even assume that a hit expresses any opinion at all, or is even

    hitler          59,600,000
    stalin          25,700,000
    roosevelt       81,800,000
    churchill       61,900,000
    eisenhower      31,100,000
    montgomery     133,000,000
    patton          31,200,000
    mussolini        9,420,000
    hirohito         1,040,000

I'll add a bit of context by adding "world war II" as a search constraint. Previously, Churchill had about the same number of hits as Hitler, but under this constraint Churchill has about half. Montgomery is cut down to size by the added context and turns up with as many hits as Eisenhower. Hitler rises to the top of the pack though. Remember, Hitler's the only one of the bunch that didn't survive the war (although Patton barely survived it and died before he could make it back to the States).

    hitler          11,900,000
    stalin           5,590,000
    roosevelt        8,660,000
    churchill        4,150,000
    eisenhower       3,480,000
    montgomery       3,350,000
    patton           1,230,000
    mussolini          795,000
    hirohito           180,000

If web search results really show the popularity of something, it should work for any topic. Does anyone want to claim in works here?

Maybe you don't like that because you don't like using Hitler as an example. I'll use the current president, George W. Bush. Does anyone want to claim that every one of these hits supports the president? Not only that, next week there will most likely be even more hits, and even more the next week. Does anyone want to claim Bush is getting more popular?

    "George W. Bush"    157,000,000
    "president bush"    217,000,000

To interpret these numbers as anything other than the number of pages that Google indexed under that search term is dishonest. I can't tell how many of those hits have a particular opinion, so I can't use the number of hits to support any opinion.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Have you seen Google Fight [googlefight.com]?

    :-)

    • Very cool...

      I noticed that even those numbers don't agree with the numbers I got from Google. I don't dispute that Java probably has more hits, but the absolute numbers seem to be magical.

      I also recall when I was programming Java that I needed a stack of books about waist high (no kidding). Perl's stack maybe got to my ankle, although back then, there was only the Camel and the Llama. :)
  • I would say that Hitler is indeed the most "popular" in the sense of "well known, frequently encountered" (frequently talked about), while not in the sense of "commonly liked or approved". As the saying goes, there is no bad publicity.

    However, I still agree with the general point that there are caveats with over-interpreting search results.