Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • The problem with a universal corpus is that one person's spam is another person's ham.

    Just to pick a random example, almost any message received by you or me that is written in German is going to be spam. My command of the language doesn't go much deeper than "wo ist der Bierhaus, fraulein?", so anyone sending me an email in German is probably trying to sell me something. So for me, German text is at least a 98% confident indicator of spam.

    On the other hand, there are hundreds of millions of German speakers out there who probably have a different opinion on my conclusion. I can hardly blame them.

    The whole point of statistical filters like Bayes is that they let individuals come up with statistical descriptions of their own spam/ham corpus. While you could apply some of these techniques to a universal corpus, it wouldn't be anywhere near as accurate or useful as one built around your own mail patterns.

    Moreover, part of the reason that individual statistical filters are effective is that they are hetergeneous: because everyone has a different profile, it's difficult or impossible for spammers to come up with messages that will be misinterpreted as ham for everyone. If you have any kind of standard, widely used spam characteristics -- like the heuristics encoded into SpamAssassin's rules -- then spammers can exploit those properties to "cloak" their messages as ham; hence, spam headers that include Mutt, Pine, Outlook, Mozilla, Eudora, and AOL as the mail agent, because each of those gives a few "ham" points in SpamAssassin; hence, revisions to later versions of SpamAssassin to penalize messages that cite multiple mail agents.

    A universal spam corpus is an appealing idea, and there are research contexts where having one makes sense, but as examples like the mail agent headers show, filtering techniques built around a universal profile would often be easy to defeat. It would be nice if there were easier ways to package them, but generally speaking it seems like individual profiles seem to be the way to go.



    • I think that you can set things up so that your own private ruleset is used to actively delete spam, while the public ruleset would be used to flag spam while still pasing it on. Then, you can scan the flagged items fairly quickly and use them to update your private ruleset.