Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I also posted very similar code [sourceforge.net] to do this to the SpamAssassin-Talk mailing list yesterday, in case anyone is interested in a slightly different encoding of the algorithm.

    I use my own mail parser class that doesn't use memory (it uses temp files instead), and decodes all the MIME stuff for you. Might be worth checking out too in case anyone is interested.

    We'll probably plug this into SA 2.41+ or SA3 (whichever comes first).
    • If you've got SpamAssasin covered, I'll keep going on a Mail::Audit plugin (which also handles MIME). I've reworked the algorithm to scan the weighting file after tokenizing the message body. No more sucking everything into memory.

      By the way, a simple tokenizer tweak cut my falst negatives in half. I only force a token to lowercase if at least one character is already lowercase. This has the effect of keeping a separate (high) weights for "MILLION" and "EMAILS" than for "million" and "emails", which have l