Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

geoff (2013)

  reversethis-{gro ... om} {ta} {ffoeg}

see [] for personal information, links to presentations, GPG key, and so on.

Journal of geoff (2013)

Thursday June 10, 2004
08:07 AM

fighting foreign spam

[ #19179 ]
I've been getting an increasing amount of foreign language spam lately - foreign as in german, not the typical asian variants. ordinarily, I don't notice that my spam volume is increasing, except that these have been getting past my filter and ending up in my inbox, I suspect because I haven't taught my filters to pick up on the non-english tokens.

do people with .de addresses get more native language spam?

this got me thinking - why do I have to train my filters at all? I mean, certainly there is enough spam (english and non-english) floating around the world that a decent corpus could be assembled, regularly added to, and made available. this would have a number of advantages, like a larger corpus for more accurate results. it would also enable SA users to react to new spam forms more quickly than they could on their own - users contribute spam consistently and download a new database regularly and *poof* you have one killer spam-fighting machine.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • Sounds like you're thinking of Vipul's Razor.
    • yeah, I knew it was a good idea :) but really what I want is not another plugin I need to administer.

      what I originally thought about was it would be cool to take a german friend and add the results from his sa-learn sessions to my own, but I'm not sure if there is a routine for merging databases like that or not. making it a globally shared database was just the logical extension.
      • --dump and --import

        sa-learn can --dump its database and --import other databases. --import suggests that it is for old formats, but I'd guess (with no evidence) that it works with current formats. It also says it clobbers the current DB_FILE, but I'd guess (with no evidence) that it could be rewritten to allow merging.

        Then you just need CBAN.
  • The problem with a universal corpus is that one person's spam is another person's ham.

    Just to pick a random example, almost any message received by you or me that is written in German is going to be spam. My command of the language doesn't go much deeper than "wo ist der Bierhaus, fraulein?", so anyone sending me an email in German is probably trying to sell me something. So for me, German text is at least a 98% confident indicator of spam.

    On the other hand, there are hundreds of millions of German spea



    • I think that you can set things up so that your own private ruleset is used to actively delete spam, while the public ruleset would be used to flag spam while still pasing it on. Then, you can scan the flagged items fairly quickly and use them to update your private ruleset.
  • You can block most of this in SA by using the locale settings, and adding something like "score CHARSET_FARAWAY 5.0" to your user_prefs.
    • Actually what you want is the ok_languages setting. Just set this to be "en" (or whatever languages you can read) and be done with it.
  • I'm (in Germany) have been getting a lot of italian spam lately that was getting across sa. Usually almost all of the spam I get is English, though, maybe one or two german messages per day.

    When we first got sa installed at my company, the sysad decided all English mail must be spam and gave it a 4 point spam bonus. I got him to reverse that decision, though :)