Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • by ziggy (25) on 2002.08.16 13:03 (#11856) Journal
    The ever-sensible Paul Graham has a new article [paulgraham.com] on accurately filtering out spam using Bayesian probabilities. He's claiming missing 5 out of 1000 spams, with zero false positives. The problem with his technique is that it needs to be trained to see what kind of messages and spam you receive. The benefit is that the probability model is finely tuned to the messages you receive.

    There's lots of analysis about spam in the article, including a few well-reasoned explanations on why spam exists and why spam is bad:

    All along the spectrum, if you restrict the sales pitches spammers can make, you will inevitably tend to put them out of business. That word business is an important one to remember. The spammers are businessmen. They send spam because it works. It works because although the response rate is abominably low (maybe 15 per million, vs 3000 per million for a catalog mailing), the cost, to them, is practically nothing. The cost is enormous for the recipients, about 5 man-weeks for each million recipients who spend a second to delete the spam, but the spammer doesn't have to pay that. Even so, sending spam does cost the spammer something, so the lower we can get the response rate, the fewer businesses will find it worth their while to send spam.
    • Bayes is very good if you can tune it to your type of email. By the looks of things, Paul Graham gets very little business-like emails. That's where we found our largest set of false positives with it. He's also right - doing bayes against word pairs is better than against single words, but your database does grow a lot larger.

      We're getting about 90% accuracy with it - on real customer emails.