Thursday June 10, 2004
08:07 AM
fighting foreign spam
I've been getting an increasing amount of foreign language spam lately - foreign as in german, not the typical asian variants. ordinarily, I don't notice that my spam volume is increasing, except that these have been getting past my filter and ending up in my inbox, I suspect because I haven't taught my filters to pick up on the non-english tokens.
do people with .de addresses get more native language spam?
this got me thinking - why do I have to train my filters at all? I mean, certainly there is enough spam (english and non-english) floating around the world that a decent corpus could be assembled, regularly added to, and made available. this would have a number of advantages, like a larger corpus for more accurate results. it would also enable SA users to react to new spam forms more quickly than they could on their own - users contribute spam consistently and download a new database regularly and *poof* you have one killer spam-fighting machine.
razor (Score:1)
http://razor.sourceforge.net/
rjbs
Re:razor (Score:2)
what I originally thought about was it would be cool to take a german friend and add the results from his
sa-learnsessions to my own, but I'm not sure if there is a routine for merging databases like that or not. making it a globally shared database was just the logical extension.Re:razor (Score:1)
sa-learn can --dump its database and --import other databases. --import suggests that it is for old formats, but I'd guess (with no evidence) that it works with current formats. It also says it clobbers the current DB_FILE, but I'd guess (with no evidence) that it could be rewritten to allow merging.
Then you just need CBAN.
rjbs
corpuses (corpi?) don't work that way (Score:2)
The problem with a universal corpus is that one person's spam is another person's ham.
Just to pick a random example, almost any message received by you or me that is written in German is going to be spam. My command of the language doesn't go much deeper than "wo ist der Bierhaus, fraulein?", so anyone sending me an email in German is probably trying to sell me something. So for me, German text is at least a 98% confident indicator of spam.
On the other hand, there are hundreds of millions of German spea
--
DO NOT LEAVE IT IS NOT REAL.
Re:corpuses (corpi?) don't work that way (Score:2)
locale stuff (Score:1)
Re:locale stuff (Score:2)
foreign spam (Score:2)
I'm (in Germany) have been getting a lot of italian spam lately that was getting across sa. Usually almost all of the spam I get is English, though, maybe one or two german messages per day.
When we first got sa installed at my company, the sysad decided all English mail must be spam and gave it a 4 point spam bonus. I got him to reverse that decision, though :)