Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Matts (1087)

  (email not shown publicly)

I work for MessageLabs [] in Toronto, ON, Canada. I write spam filters, MTA software, high performance network software, string matching algorithms, and other cool stuff mostly in Perl and C.

Journal of Matts (1087)

Monday September 30, 2002
07:02 AM

Need asian emails

[ #8091 ]

Anyone know where I can get a good source of asian hams (non-spams)? I need Japanese, Chinese, Taiwanese, and Korean emails to feed to a classifier.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • Asian Usenet groups I guess would approximate - presumably if you picked a cross section arts, comp, etc. The only danger is that they contain spam - I don't know what the percentage to non-spam would be.
    • That's my biggest problem in this - I have absolutely no idea if something in Korean is spam or not... The only differentiator I have is the "other" clues, such as HTML forms, Javascript, etc.
  • Well, I don't have any specific sources, but I can suggest a strategy for obtaining them.

    Consider that if you wanted a reliable source of non-spam english email, one method might be to find a selection of email lists which are configured to only allow subscribers to post to the list. Subscribe some address to that list, and start listening in. Most spammers do not bother subscribing to a gazillion email lists in order to post a single spam message (before their posting address gets unsubscribed for their p
    • another source...

      I remember dejanews used to do a restpectable job of eliminating spam from their web archive of usenet. If google is doing at least as good of a job, you could see how comfortable you are with their spam-culling capabilities in english. If the ratio looks good, then you could try harvesting from their non-English groups.

      Unfortunately, this assumes that they do as good of a job in each language. That may be a horribly-flawed assumption, and should certainly be checked. It may be that (like
    • > 1) find lists with web archives. Post to them.

      That is such a bad idea.

      If the list allows everyone to subscribe it's probably because they don't have a problem because of it. You'll be a problem.

      If the list doesn't allow everyone to post, the moderator will probably get your mail. He gets plenty spam without you adding to it.

        - ask

      -- ask bjoern hansen [], !try; do();