Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Matts (1087)

Matts
  (email not shown publicly)

I work for MessageLabs [messagelabs.com] in Toronto, ON, Canada. I write spam filters, MTA software, high performance network software, string matching algorithms, and other cool stuff mostly in Perl and C.

Journal of Matts (1087)

Tuesday April 30, 2002
01:04 PM

Mail parsing

[ #4558 ]

People may be wondering why I've not been doing much in the world of XML lately, well it's because I've been busy with my anti-spam stuff. At work I'm developing their next generation anti-spam solution (at the moment it's just based on enabling one or more realtime block lists). Most of it is based on SpamAssassin, but I'm working on some new stuff using Bayesian Probability, which is pretty interesting. At the moment it's getting about 85-90% effectiveness (that's just the Bayes stuff - I'll be combining that with the SpamAssassin rules stuff to get an even higher catch rate), but I think I can get it a bit higher than that with some tuning. Plus someone I know from #axkit is trying to talk me into using bayesian neural nets, but I'll have to see about that - it's already at the point where my brain is cracking under the strain!

Perhaps the largest part of this work has been in doing improved email parsing. We have a fantastic email parser at work, but I wanted to do it in Perl. I already had some old code lying around, so I basically improved on that.

So why didn't I use some other CPAN module? Well several reasons:

1. They all seem to use RAM to parse emails. Well we receive attachments in the multi-megabyte size, and the email parsing modules start to suck up gobs of RAM when they encounter these (has this changed since last time I checked?). The one I wrote uses temp files for everything (including the email body).

2. They don't make any effort to decode the content from the given encoding to UTF-8. Mine decodes everything to UTF-8. Maybe this has also changed since last time I checked.

3. Attempts to act like email clients in the way it decodes stuff.

4. I wanted to do it myself as an exercise. Your own code is always easier to hack on than someone elses.

Anyway, if anyone wants the code, I'd be willing to consider releasing it under a private namespace. Let me know if there's any interest whatsoever. I need to do some more testing on it - I've got 20,000 emails to run it through from the last couple of days traffic on one of our servers. If it can parse all of those, I think I'll have pretty good coverage.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Parsing Email (Score:3, Informative)

    by ziggy (25) on 2002.04.30 13:43 (#7755) Journal
    They all seem to use RAM to parse emails. Well we receive attachments in the multi-megabyte size, and the email parsing modules start to suck up gobs of RAM when they encounter these (has this changed since last time I checked?). The one I wrote uses temp files for everything (including the email body).
    I've written a mailbox iterator a couple of times. Every time I finish, I ask myself if it's something worth releasing, and more often than not, the answer I come up with is "no".

    The technique I use is to get an open filehandle for a mailbox (good for "zcat mbox.gz |"), and then load up one message at a time, stopping at the next messge or the end of file. All that really boils down to is treating a line that matches /^From .*\d{4}$/ as start-of-[next-]message. That always feels so trivial.

    Once that's done, then it's a simple issue of shoving that scalar at Graham's mail parser and calling it a day. But the kind stuff I do with email doesn't get into attachments or charsets (yet).

    What I'd like to see is a mail parsing library in C. The few times I've started one of these projects, I'm amazed at how fast Mutt plows through a mailbox, and how slow it takes Mail::* to do the same thing.

    • Actually I did see one on freshmeat just the other day... Ah yes, there it is [freshmeat.net]. Looks quite a bit like our parser at work (only probably doesn't support as many freaky fringe conditions as ours does, but most people don't need that).

      I should also do some timing on mine to see how fast it is. I imagine mutt is fast simply because it punts scanning the email until "later", so it would be really tricky to compare its speed to something aimed at parsing a single email.
  • Bayesian neural nets (Score:3, Interesting)

    by jdavidb (1361) on 2002.04.30 16:47 (#7764) Homepage Journal

    I know what Bayesian networks are, and I know what neural networks are, but I don't know what Bayesian neural networks are.

    That said, I just took off all day Monday so I could spend all night Sunday writing a pure-Perl implementation of a multilayer feedforward neural network with backpropagation training algorithm, using PDL. This is yours for the asking, if it's useful and you want it. (I'm speculating Bayesian neural networks are going to be so different that nothing here would be useful.)

    --
    J. David works really hard, has a passion for writing good software, and knows many of the world's best Perl programmers
    • Could potentially be very useful, if it's easy to use. matt@sergeant.org if you want to send it along. I would of course be using it in a non-free software project.

      I think bayesian neural networks basically use Bayes' theory of probability to determine the network, rather than simpler training algorithms. But I'm guessing - I haven't read up on it yet.

      If anyone's interested, I'm now getting about 95% accuracy on spam detection, and about 90% accuracy on non-spam detection (the systems tells me if it think