Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

pudge (1)

pudge
  (email not shown publicly)
http://pudge.net/
AOL IM: Crimethnk (Add Buddy, Send Message)

I run this joint, see?

Journal of pudge (1)

Saturday February 23, 2002
03:51 PM

obviously

[ #3075 ]

TorgoX noted that one of his journals was not showing up in the fulltext journal search by the term "obviously". Hm. After some searching br Krow, it is discovered that MySQL 3.x has a stopword list compiled in if COMPILE_STOPWORDS_IN is defined, and obviously is a stopword. Yow. We can edit the list, undef the COMPILE_STOPWORDS_IN, or not use that word anymore!

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • 570 stop words!? That's insane. I thought it was odd when some other search software I was using had "furthermore" as a stop word. Aren't stop words supposed to be common enough to be considered noise?

    Other words that MySQL thinks we'll never need to search for include "unfortunately", "willing", "afterwards" (but not "afterward"?), "corresponding", "associated", "known", "second", "unlikely", "better", and "immediate". It seems like a pretty random selection.
    • I think it's quite overdone too, but OTOH, these words don't really tell very much at all about the content of a given bit of text, in the general sense. Were I looking for TorgoX's journal entry in question, the chances of me looking for it by searching for "obviously" are pretty slight. The only journal entry I might think of looking for by that word is this one here. :-D
      • That's true, but if someone is searching for "obviously" then presumably they have a reason for it, and it doesn't seem like you'd save that much room in your indexes by leaving it out.

        The specific problem that caused me to delve into the list of stop words in the other search software was that people were unable to search for the cigarette brand More (this was a database of tobacco company documents released as a result of court orders) because "more" was a stop word.

        I guess that's a good reason not to n
        • That is an interesting point I've thought of in the past, but not in terms of stopwords; what kinds of brands/labels/names did people pick previously that they would no longer pick because of no available domain name or weak search results?
          • I've always been annoyed that Microsoft doesn't choose searchable names for their products. It's much easier to search for WordPerfect than for Word (or Access or Excel or Windows or ...). I assume searchability is why Allaire changed the name Cold Fusion to ColdFusion.

            By the way, the username and the word "Journal" (but not, for some reason, the "'s") in the heading have been white in light mode for a few weeks, which makes them invisible with normal settings. Probably something that got changed around
            • No plans to list journals by # of comments. The light mode problem should be fixed at next update (comitted to CVS a few days ago, IIRC).
      • My father-in-law has been noting how frequently the word obviously is said during interviews. Once he pointed it out, they were, obviously, omnipresent.