NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.
All the Perl that's Practical to Extract and Report
Stories, comments, journals, and other submissions on use Perl; are Copyright 1998-2006, their respective owners.
corpuses (corpi?) don't work that way (Score:2)
The problem with a universal corpus is that one person's spam is another person's ham.
Just to pick a random example, almost any message received by you or me that is written in German is going to be spam. My command of the language doesn't go much deeper than "wo ist der Bierhaus, fraulein?", so anyone sending me an email in German is probably trying to sell me something. So for me, German text is at least a 98% confident indicator of spam.
On the other hand, there are hundreds of millions of German speakers out there who probably have a different opinion on my conclusion. I can hardly blame them.
The whole point of statistical filters like Bayes is that they let individuals come up with statistical descriptions of their own spam/ham corpus. While you could apply some of these techniques to a universal corpus, it wouldn't be anywhere near as accurate or useful as one built around your own mail patterns.
Moreover, part of the reason that individual statistical filters are effective is that they are hetergeneous: because everyone has a different profile, it's difficult or impossible for spammers to come up with messages that will be misinterpreted as ham for everyone. If you have any kind of standard, widely used spam characteristics -- like the heuristics encoded into SpamAssassin's rules -- then spammers can exploit those properties to "cloak" their messages as ham; hence, spam headers that include Mutt, Pine, Outlook, Mozilla, Eudora, and AOL as the mail agent, because each of those gives a few "ham" points in SpamAssassin; hence, revisions to later versions of SpamAssassin to penalize messages that cite multiple mail agents.
A universal spam corpus is an appealing idea, and there are research contexts where having one makes sense, but as examples like the mail agent headers show, filtering techniques built around a universal profile would often be easy to defeat. It would be nice if there were easier ways to package them, but generally speaking it seems like individual profiles seem to be the way to go.
--
DO NOT LEAVE IT IS NOT REAL.
Reply to This
Re:corpuses (corpi?) don't work that way (Score:2)