I spent yesterday in blissful play, at least at first, running a bayesian spam filter against a big backlog of mail. I'm getting no false positives on good mail, and ~5% false negatives on message in my spam file. Exploring this, I first found that there were some messages in the spam pile that shouldn't have been there. Oops. Fortunately, none were job offers.
The other false negatives where more interesting. One class of them are very short spams. Here, I'm suspecting that Graham's algorithm needs to be tuned to use less than 15 weights for short messages. Gotta play with this one.
The other class of false negatives leads me to suspect that Graham simplified a few things for purposes of writing a compelling article. His examples show that he's cannonified tokens to lower case. My first guess was that he'd done this to save space, and I dutifully translitered. But this misses obvious things like "25 MILLION EMAILS". From my sampling, "25" carries a weight of 0.3935, "million" weighs 0.2794 and "emails" weighs 0.7363. Assuming these are even selected by the algorithm, they tilt toward non-spam. The obvious thing to do is retain case, but this blows the data way up, and scanning a 1Mb weighting table for each email has got to suck. I'm going to try converting to lower case if and only if there's already at least one lower-case character in the token.
However, a couple of folks have requested code, so next up is to get this working with procmail. Then a thousand experiments can bloom.
Fun. Much fun.