People may be wondering why I've not been doing much in the world of XML lately, well it's because I've been busy with my anti-spam stuff. At work I'm developing their next generation anti-spam solution (at the moment it's just based on enabling one or more realtime block lists). Most of it is based on SpamAssassin, but I'm working on some new stuff using Bayesian Probability, which is pretty interesting. At the moment it's getting about 85-90% effectiveness (that's just the Bayes stuff - I'll be combining that with the SpamAssassin rules stuff to get an even higher catch rate), but I think I can get it a bit higher than that with some tuning. Plus someone I know from #axkit is trying to talk me into using bayesian neural nets, but I'll have to see about that - it's already at the point where my brain is cracking under the strain!
Perhaps the largest part of this work has been in doing improved email parsing. We have a fantastic email parser at work, but I wanted to do it in Perl. I already had some old code lying around, so I basically improved on that.
So why didn't I use some other CPAN module? Well several reasons:
1. They all seem to use RAM to parse emails. Well we receive attachments in the multi-megabyte size, and the email parsing modules start to suck up gobs of RAM when they encounter these (has this changed since last time I checked?). The one I wrote uses temp files for everything (including the email body).
2. They don't make any effort to decode the content from the given encoding to UTF-8. Mine decodes everything to UTF-8. Maybe this has also changed since last time I checked.
3. Attempts to act like email clients in the way it decodes stuff.
4. I wanted to do it myself as an exercise. Your own code is always easier to hack on than someone elses.
Anyway, if anyone wants the code, I'd be willing to consider releasing it under a private namespace. Let me know if there's any interest whatsoever. I need to do some more testing on it - I've got 20,000 emails to run it through from the last couple of days traffic on one of our servers. If it can parse all of those, I think I'll have pretty good coverage.