Matts (email not shown publicly) I work for MessageLabs [messagelabs.com] in Toronto, ON, Canada. I write spam filters, MTA software, high performance network software, string matching algorithms, and other cool stuff mostly in Perl and C.
Asian Usenet groups I guess would approximate - presumably if you picked a cross section arts, comp, etc. The only danger is that they contain spam - I don't know what the percentage to non-spam would be.
That's my biggest problem in this - I have absolutely no idea if something in Korean is spam or not... The only differentiator I have is the "other" clues, such as HTML forms, Javascript, etc.
Well, I don't have any specific sources, but I can suggest a strategy for obtaining them.
Consider that if you wanted a reliable source of non-spam english email, one method might be to find a selection of email lists which are configured to only allow subscribers to post to the list. Subscribe some address to that list, and start listening in. Most spammers do not bother subscribing to a gazillion email lists in order to post a single spam message (before their posting address gets unsubscribed for their p
I remember dejanews used to do a restpectable job of eliminating spam from their web archive of usenet. If google is doing at least as good of a job, you could see how comfortable you are with their spam-culling capabilities in english. If the ratio looks good, then you could try harvesting from their non-English groups.
Unfortunately, this assumes that they do as good of a job in each language. That may be a horribly-flawed assumption, and should certainly be checked. It may be that (like
Asian emails (Score:2)
Re:Asian emails (Score:2)
asian non-spam email (Score:2)
Consider that if you wanted a reliable source of non-spam english email, one method might be to find a selection of email lists which are configured to only allow subscribers to post to the list. Subscribe some address to that list, and start listening in. Most spammers do not bother subscribing to a gazillion email lists in order to post a single spam message (before their posting address gets unsubscribed for their p
Re:asian non-spam email (Score:2)
I remember dejanews used to do a restpectable job of eliminating spam from their web archive of usenet. If google is doing at least as good of a job, you could see how comfortable you are with their spam-culling capabilities in english. If the ratio looks good, then you could try harvesting from their non-English groups.
Unfortunately, this assumes that they do as good of a job in each language. That may be a horribly-flawed assumption, and should certainly be checked. It may be that (like
Re:asian non-spam email (Score:2)
That is such a bad idea.
If the list allows everyone to subscribe it's probably because they don't have a problem because of it. You'll be a problem.
If the list doesn't allow everyone to post, the moderator will probably get your mail. He gets plenty spam without you adding to it.
- ask
-- ask bjoern hansen [askbjoernhansen.com], !try; do();