When looking at the annoyance factor of unwanted mail, bounce messages (caused by some insane worms randomly sending mails with arbitrary from-addresses) seem to have overhauled ordinary spam. The problem with those is that they pass my spam filters and now I have to take steps.
I figure that it should be possible to get an almost flawless detection of those bounces with a specially tailored bayesian filter. Note that I don't want to use the existing bayes filter (as part of SpamAssassin for example). I would first have to train them and also, I suspect that real spam and bounces don't have much in common when looking at the used words.
So what I have started doing now is writing a bayesian filter for bounces. First thing I wrote was a flex-scanner that detects valid RFC822 mail addresses. The scanner gets fed one message. It opens a pipe to another process (the one that does the actual filtering) and writes the mail to this process. The only thing the scanner does is replacing every email-address it can find in the body with
T_MAILADDR or somesuch. When reading RFC822 correctly, the below should be the rules for a valid email-address:
This should be a huge advantage for a bayesian filter since now not every single email-address is a word for its own but rather they get mapped onto one word.
The idea behind that is of course, that bounce messages tend to have a lot of email addresses in their body. Some of them even include whole header fields, so I could extend the scanner to detect those and generate another token for them.
For now I'll prototype the program that the scanner opens a pipe to in Perl and see whether the approach makes any sense at all. If it does, I can rewrite it in C and have a fairly well-performing bayesian filter that I can plug into my
.procmailrc before spamassassin is even triggered.