When looking at the annoyance factor of unwanted mail, bounce messages (caused by some insane worms randomly sending mails with arbitrary from-addresses) seem to have overhauled ordinary spam. The problem with those is that they pass my spam filters and now I have to take steps.
I figure that it should be possible to get an almost flawless detection of those bounces with a specially tailored bayesian filter. Note that I don't want to use the existing bayes filter (as part of SpamAssassin for example). I would first have to train them and also, I suspect that real spam and bounces don't have much in common when looking at the used words.
So what I have started doing now is writing a bayesian filter for bounces. First thing I wrote was a flex-scanner that detects valid RFC822 mail addresses. The scanner gets fed one message. It opens a pipe to another process (the one that does the actual filtering) and writes the mail to this process. The only thing the scanner does is replacing every email-address it can find in the body with T_MAILADDR or somesuch. When reading RFC822 correctly, the below should be the rules for a valid email-address:
atom [!#$%&'-/0-9A-Za-z_`{}|~^]*
dtext [\x00-\x0C\x0E-\x5A\x5E-\x7F]*
qtext [\x00-\x0C\x0E-\x21\x23-\x5B\x5D-\x7F]*
quoted_pair "\\"[\x00-\x7F]
quoted_string "\""({qtext}|{quoted_pair})*"\""
word {atom}|{quoted_string}
domain_literal "["({dtext}|{quoted_pair})*"]"
domain_ref {atom}
sub_domain {domain_ref}|{domain_literal}
domain {sub_domain}("."{sub_domain})*
local_part {word}("."{word})*
addr_spec {local_part}"@"{domain}
This should be a huge advantage for a bayesian filter since now not every single email-address is a word for its own but rather they get mapped onto one word.
The idea behind that is of course, that bounce messages tend to have a lot of email addresses in their body. Some of them even include whole header fields, so I could extend the scanner to detect those and generate another token for them.
For now I'll prototype the program that the scanner opens a pipe to in Perl and see whether the approach makes any sense at all. If it does, I can rewrite it in C and have a fairly well-performing bayesian filter that I can plug into my.procmailrc before spamassassin is even triggered.
Seems Like a Lot of Work (Score:1)
I check the
Return-Pathheader for<>. That catches most of the bounces.Re:Seems Like a Lot of Work (Score:2)
Return-Pathheader at all. The attached original message usually has one, but it always contains one of my addresses.from: ne from_ (Score:2)
But I have still not received a single message on that address that wasn't a bounce that I wanted to read.
The mail address that I use in the headers is juerd@example.com (but with ano
Re:from: ne from_ (Score:2)
Of course there are stupid mail systems o
Re:from: ne from_ (Score:2)