Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

ethan (3163)

ethan
  reversethis-{ed. ... rap.nov.olissat}

Being a 25-year old chap living in the western-most town of Germany. Stuying communication and information science and being a huge fan of XS-related things.

Journal of ethan (3163)

Saturday May 29, 2004
02:07 AM

Dealing with bounces

[ #18994 ]

When looking at the annoyance factor of unwanted mail, bounce messages (caused by some insane worms randomly sending mails with arbitrary from-addresses) seem to have overhauled ordinary spam. The problem with those is that they pass my spam filters and now I have to take steps.

I figure that it should be possible to get an almost flawless detection of those bounces with a specially tailored bayesian filter. Note that I don't want to use the existing bayes filter (as part of SpamAssassin for example). I would first have to train them and also, I suspect that real spam and bounces don't have much in common when looking at the used words.

So what I have started doing now is writing a bayesian filter for bounces. First thing I wrote was a flex-scanner that detects valid RFC822 mail addresses. The scanner gets fed one message. It opens a pipe to another process (the one that does the actual filtering) and writes the mail to this process. The only thing the scanner does is replacing every email-address it can find in the body with T_MAILADDR or somesuch. When reading RFC822 correctly, the below should be the rules for a valid email-address:

    atom            [!#$%&'-/0-9A-Za-z_`{}|~^]*
    dtext           [\x00-\x0C\x0E-\x5A\x5E-\x7F]*
    qtext           [\x00-\x0C\x0E-\x21\x23-\x5B\x5D-\x7F]*
    quoted_pair     "\\"[\x00-\x7F]
    quoted_string   "\""({qtext}|{quoted_pair})*"\""
    word            {atom}|{quoted_string}
 
    domain_literal  "["({dtext}|{quoted_pair})*"]"
    domain_ref      {atom}
    sub_domain      {domain_ref}|{domain_literal}
    domain          {sub_domain}("."{sub_domain})*
    local_part      {word}("."{word})*
 
    addr_spec       {local_part}"@"{domain}

This should be a huge advantage for a bayesian filter since now not every single email-address is a word for its own but rather they get mapped onto one word.

The idea behind that is of course, that bounce messages tend to have a lot of email addresses in their body. Some of them even include whole header fields, so I could extend the scanner to detect those and generate another token for them.

For now I'll prototype the program that the scanner opens a pipe to in Perl and see whether the approach makes any sense at all. If it does, I can rewrite it in C and have a fairly well-performing bayesian filter that I can plug into my .procmailrc before spamassassin is even triggered.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I check the Return-Path header for <>. That catches most of the bounces.

  • To know which bounces I want to keep and which I want to throw away, a while ago I thought it would be a nice experiment to see if spammers ever use the envelope from that I use. Originally, I would if I did indeed get spam on that address encode some information in the envelope from to find out how they got that information.

    But I have still not received a single message on that address that wasn't a bounce that I wanted to read.

    The mail address that I use in the headers is juerd@example.com (but with ano
    • If you send out all your mail with juerd@c4.example.com as the MAIL FROM, then that address will end up in some Received lines in some messages. Some of those messages will end up on computers that get infected with worms. Some of them will end up somewhere on the web, where the address will be harvested by spammers. So soon you'll be getting bounces to that address in response to spam and worms. I guess that means you'll have to change the address periodically.

      Of course there are stupid mail systems o
      • That's what I thought, but I've been using this for over a year now and I haven't received a single unwanted message at juerd@c4.example.com yet.