Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Mark Leighton Fisher (4252)

Mark Leighton Fisher
  (email not shown publicly)
http://mark-fisher.home.mindspring.com/

I am a Systems Engineer at Regenstrief Institute [regenstrief.org]. I also own Fisher's Creek Consulting [comcast.net].
Friday May 30, 2008
12:04 PM

PageRank is Precomputed Relevancy Ranking

[ #36553 ]

Google's PageRank is precomputed relevancy ranking, where the heavy lifting of actual relevancy ranking is done by us humans. Why is this important? I was re-reading A new comparison between conventional indexing (MEDLARS) and automatic text processing (SMART), which lays out how computerized indexing can beat the best manual indexing by:

  • Using a stop-word list;
  • Using a thesaurus (synonyms); and
  • Relevancy ranking.

(It's more complicated than that, but you get the idea.) Relevancy ranking is the hardest part of the indexing job, as there are no clear-cut algorithms for relevancy ranking with both excellent precision and excellent recall (getting all of the documents you want and none of the documents you don't want). Google's PageRank works around the difficulty of relevancy ranking by handing the hardest part the ranking of individual documents to us humans. You can get good results from proper metadata, but metadata is useful only in environments where no one has interest in gaming the metadata (I wonder if it should be called "The Semantic Intranet"? That's where Semantic Web technologies really make sense to me.)

The original paper is worth a read, especially if you work on software that incorporates search and these days, I suspect that almost any non-embedded program could grow to a point where it incorporates a search mechanism (and an email client, and a web browser you get the point).

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • ... and decided to more or less farm it out to professionals.

    So at work, we're deploying a product called Endecca which does all the insane multi-dimensional graph magic.

    Similarly, I noticed DreamHost recommending farming mail out to Google.

    Seems these sorts of areas where you have some form of non-trivial problem is the ideal place to either specialise as much as possible, or centralise efforts.