Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • 570 stop words!? That's insane. I thought it was odd when some other search software I was using had "furthermore" as a stop word. Aren't stop words supposed to be common enough to be considered noise?

    Other words that MySQL thinks we'll never need to search for include "unfortunately", "willing", "afterwards" (but not "afterward"?), "corresponding", "associated", "known", "second", "unlikely", "better", and "immediate". It seems like a pretty random selection.
    • I think it's quite overdone too, but OTOH, these words don't really tell very much at all about the content of a given bit of text, in the general sense. Were I looking for TorgoX's journal entry in question, the chances of me looking for it by searching for "obviously" are pretty slight. The only journal entry I might think of looking for by that word is this one here. :-D
      • by vsergu (505) on 2002.02.23 23:39 (#5017) Journal
        That's true, but if someone is searching for "obviously" then presumably they have a reason for it, and it doesn't seem like you'd save that much room in your indexes by leaving it out.

        The specific problem that caused me to delve into the list of stop words in the other search software was that people were unable to search for the cigarette brand More (this was a database of tobacco company documents released as a result of court orders) because "more" was a stop word.

        I guess that's a good reason not to name your band The The.
        • That is an interesting point I've thought of in the past, but not in terms of stopwords; what kinds of brands/labels/names did people pick previously that they would no longer pick because of no available domain name or weak search results?
          • I've always been annoyed that Microsoft doesn't choose searchable names for their products. It's much easier to search for WordPerfect than for Word (or Access or Excel or Windows or ...). I assume searchability is why Allaire changed the name Cold Fusion to ColdFusion.

            By the way, the username and the word "Journal" (but not, for some reason, the "'s") in the heading have been white in light mode for a few weeks, which makes them invisible with normal settings. Probably something that got changed around
            • No plans to list journals by # of comments. The light mode problem should be fixed at next update (comitted to CVS a few days ago, IIRC).