Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

dws (341)

dws
  (email not shown publicly)
http://www.davewsmith.com/

Journal of dws (341)

Friday August 16, 2002
11:28 PM

Viagra, meet Bayes

[ #7140 ]

Paul Graham has written a wonderful article, A Plan for Spam, about using Bayesian techniques for detecting spam. The idea is to cull word frequencies from collections of known spam and non-spam, calculating the likelihood that a given word comes from spam, and then to use a Bayesian algorithm for computing a spam-likelihood score on an incoming email. Graham claims that the algorithm is resiliant to the type of arms race we're seeing now as spammers try to outwit SpamAssassin.

Graham's few example snippets are in Lisp. The transliteration to Perl is so far straightforward. Based on some Q&A today at PerlMonks, others have the same idea. After a few hours of work, I've run word frequency counts on a large pile of spam (I knew there was a reason to save it), and on a large pile of non-spam, and have cranked out a table of probabilities that a word is spam. Now I'm running the algorithm against the saved spam, looking for false negatives. Next up is to run the algorithm against good mail, looking for false positives.

Then comes the decision on how to hook things up. Should be easy to either shoe-horn into my existing procmail script. Or perhaps Mail::Audit is a better platform. Decisions, fun decisions...

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.