Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Journal of babbage (2619)

Sunday January 23, 2005
07:17 PM

Anti-blog spam efforts

So, anecdotally, it looks like Google's anti-blog-spam campaign may be working. A handful of easy changes to my home blog seems to have helped tremendously:

  • I looked over Google's plan, and Movable Type's recommendations.
  • I added the Movable Type implementation of the "nofollow" plugin
  • I renamed all the MT CGI scripts so that spammers have to actually look to find the comment URL.
  • I added a new script at the old comment & trackback URL:

    #!/usr/bin/perl -wT
    print "Content-type: text/plain\n\n";

  • After noticing that the spammers all seem to have a referer of "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)", I added the following code to the comment script:

    sub squash_spammers {
        my $agent  = $ENV{'HTTP_USER_AGENT'} ||= "";
        my $referer = $ENV{'HTTP_REFERER'} ||= "";
        if ( ( $agent =~ m/NET CLR 1.1.4322/ ) ||
             ( $referer =~ m@\.info/$@ ) )
          # print "Content-type: text/plain\n\nsorry\n";
            die "Sorry, this is a spam-free zone. $!";

    This is now called in the eval block that does the rest of the work for the comment script, so attempts to spam me automatically fail. If I need to add more criteria, I can hook them in as needed, but these two rules seem to have caught everything so far.

Since making these changes, things have gotten much better. I've had no comment spam this week (usually, a handful makes it past the comment spam plugin), and more strikingly, the amount of referer traffic -- requests for random URLs with referer fields like "" -- has almost, if not quite entirely, disappeared. This is wonderful.

We'll see how well it's working a month from now though ...

Saturday December 13, 2003
11:34 AM

Freeware video rotation options?

Dear Aunty use.perl;

Like many people, my wife and I have digital cameras that can record short mpeg video files in addition to traditional jpeg stills. Like any still camera, taking photos with the camera held vertically is a perfectly conventional thing to do, if the subject matter being photographed would be better framed that way. Caught up in the moment though, we've also got some video files that were shot this way, and fixing these is proving to be much harder to correct. Does anyone know of a good, relatively painless way to rotate video files so that they're right side up? As video-capable digital cameras become more common, this is a feature that I'd assume an increasing number of people will want.

I'd prefer some kind of freeware approach to this, but so far haven't found anything that seems like it will help. It seems like ImageMagick might be the most promising tool, if I can get the mpeg2vidcodec_v12 plugin working on my Mac (lots of make test errors so far...), but even then will it be as simple as a convert -rotate 90 > ? So far, I can't even get to that point with the IM toolkit. CinePaint (nee FilmGimp) didn't seem to want to open *.MOV files to begin with, which confused me as I thought that was the whole point of CinePaint. I've also looked into mjpegtools, mpgtx, VirtualDub and TMPGenc, but none of them seems able to do rotate the contents of video files. I was able to open a sideways video file as a series of hundreds of separate still images in Adobe ImageReady, but even with that program's automation tools (and my admittedly shaky grasp of how to use them), rotating them all & stitching it back into one file seems like it'll be annoying. I've also tried Apple's iMovie, but it seems to be geared towards stitching together a collection of video clips rather than manipulating the contents of any given clip in any significant way. I don't have any other commercial software available, and am not that interested in shelling out possibly hundreds for the kind of "pro" software that might work but would be overkill for my usual needs.

As an added bonus, it would be nice to be able to convert individual frames to JPEGs for making thumbnails, or ranges of frames into low-resolution GIF/MNG animations. I have a hunch that the ability to do that may fall out of any solution to the bigger problem, so I'm putting off worrying about this for now, but would like to be able to do it eventually.

Does anyone know of a good way to rotate video files? I realize that the proportions of the converted file will be "wrong", but I don't care -- they're low resolution files meant only for viewing on my computer or maybe a web page, and if I ever want to put the files on a television screen then I can just put up with the vertical letterboxing. So far, the only approach that seems to have any traction at all is to find a way to treat the file as individual frames, rotate one by one, then stitch it back together -- but that seems annoying, particularly if the file also has an audio component that has to be kept track of. Still, for lack of tools to do it any other way, that's the best approach I've been able to come up with. Can anyone suggest something better?

Wednesday October 15, 2003
07:30 PM

"software security device"?

Dear aunty,

Does anyone here understand how Mozilla / Firebird's current security module system works? In particular, does anyone know what's up with the "software security device"?

My fiancee's computer -- a WinXP laptop with no user account passwords (it's just two of us using it, and we trust each other) -- keeps throwing these annoying dialog windows demanding that you "Please enter the master password for the Software Security Device." whenever you take Firebird to a web page with a username & password.

The catch though is that no password I can think of as a likely candidate works. A bit of Googling points to a couple of semi-promising solutions, and while all the ones I've found so far talk about Linux, the general description of the issue seems to be spot on. The workaround -- enter the Linux login account -- doesn't seem to apply here: there is no Windows system login for this account, and leaving the password field blank doesn't work either.

Following on from the Mandrake advice, I tried opening up Firebird's dialog window for the security device settings (go to Tools -> Options, then Advanced -> Certificates -> Manage Security Devices [there's a disclaimer that this is subject to move around in future releases]). This brings up a cryptic dialog window with the "Device Manager" (yay! trusted computing IN OUR TIME), with a hierarchy of cryptically labelled "Security Modules and Devices" on the left (e.g. NSS Internal PKCS #11 Module -> Software Security Device), some cryptic "details" and "values" in the middle panel, and a column of cryptic buttons over on the right. (For a crypto system, they've got being cryptic nailed :-/ ).

With those right-side buttons, three seem to do with managing what appears to be the equivalent of OSX's Keychain ("Login", "Change Password", and "Load"), but again if you click on any of those you get asked for the master password -- the lack thereof being the rabbit I'm chasing down this hole. There's also a button labelled "Enable FIPS", but there seems to be no indication of what happens when you click it or what FIPS stands for (if in fact it's an acronym in the first place).

Hilariously, there's also a "Help" button on the bottom of the dialog, but it doesn't seem to be hooked up to anything. Har har har.



Where did this thing come from, and how can one either fix or disable it? If it's like Keychain, and provides some kind of encrypted safekeeping for sensitive form data, I have no problem with doing it "right" and working logged into the subsystem. As it is now though, it's just getting in the way, and I can't figure out how to reliably get it to go away and stay away.

I say "reliably", because on some sites I get the dialog almost every time I follow a link, while on others it's just at the initial login -- I assume that this has to do with how accounts are being managed on the server, but haven't been ablle to pin down what's going on there. One annoyance per site I could deal with, but repeating it all the time like this is really getting on my nerves...

Any help wins an ice cream cone -- TIA :-)

Saturday August 09, 2003
05:33 PM

Suggestions for photo gallery sites

I have a loosely organized directory tree of photos on my site. I'd like a nicer way to present these than simple Apache directory listings, but I'd rather reuse existing code (even if possibly modifying code, templates & stylesheets) than start from scratch. I've poked around a bit, but nothing I've seen seems to quite meet the rough criteria I have in mind. For people that have photo galleries on your sites, are there good solutions available from somewhere like CPAN or Sourceforge?

Ideal functional criteria that I'm looking for:

  • Easy, automatic. If I transfer a new directory of image files from my camera to, say, /web/htdocs/photos/2003/08/09/ (or something more descriptive like 09_roadtrip or 09/roadtrip etc), then this should become visible at an associated url with no additional setup work required, complete with thumbnails, titles, maybe an image resize feature, and other page decoration.
  • Listings should allow for visitor comments. If I put up a directory with 100 photos from a relative's wedding, I don't want to be forced to fill in the details for what every image depicts -- but if visitors want to add that that's fine with me.
  • Categorization: There should be an optional way to add metadata, so that for example if I or a visitor indicates that a given picture is of a certain person, place, or event, then visitors should be able to browse or search within that category. Ideally, index level pages should be aware of the categories associated with images in that section, and users should be able to follow links back & forth between (say) a directory hierarchy focused organization of links and a category focused organization.
  • Something able to run mod_perl1 would be ideal. PHP would be tolerable if the package were very nice (I might as well learn PHP sooner or later...), but given the choice between a "C+" mod_perl application and a "B+" PHP one, I'll probably go with mod_perl. Anything written in Python / Ruby / TCL / Scheme / Bourne / Zope / compiled-C would get a silly look but not necessarily be ruled out. Anything in ASP or JSP is straight out.
  • I'm not worried about most internal or external dependencies. That is, having to pull in modules from CPAN is fine with me, as is depending on Imagemagick. If it uses a templating system (Mason, Template Toolkit, or HTML::Template) that's great with me; if layout is done with CSS/XHTML1+, that's even better. I don't mind if the application depends on an external MySQL or PostgreSQL database, or some other storage layer (DBM, XML, text files, etc).

Most of the ones I've looked at seem okay, but not overwhelmingly so. The comment feature, which really appeals to me, doesn't seem to be available in most of the packages I've looked into. The main exception I can think of is a Movable Type based photoblog, which does well with comments but is cumbersome in other ways -- images have to be added one at a time, and as far as I can tell only the site owner can do anything with categories or other metadata.

The best photo sites I can think of are on a couple of different people's personal sites, but it isn't clear to me if the examples I've seen were custom written or if there's some great, uncredited gallery toolkit out there that these people are taking advantage of.

Here's some of what I've looked into so far:

  • I've tried setting up a photoblog with Movable Type, and this is okay, but it seems like you have to add images one at a time. Additionally, the tutorials I've seen so far seem to revolve around uploading images through the browser, but I have access to put them there via scp or similar. Whether or not that's a real constraint of MT photoblogs, it still seems like the intended workflow is backwards from my point of view: select an image & offer it up for review & comment, rather than put up an array of images & allow people to comment on any they happen to find interesting. While the MT one I've tried seems okay for the occasional exceptionally good image, organizing everything this way is much more work than I have in mind here.
  • I've given Apache::Gallery a try, but from the sample sites linked from, none seem to be much fancier than the file browser in Nautilus or recent versions of Windows Explorer: it's still mainly a list of directories & thumbnails, with some metadata, but no categories, no comments, etc. Mason Gallery, as demonstrated at and distributed at, seems to have the same default qualities & limitations: prettified directories but no comments, categories, etc.
  • Randal Schwartz's site at looks nice (if a bit more "Star Trek: The Next Generation" than I would have gone for, but no matter :), but it doesn't seem to have categories or comment mechanism. Paul Mison's "Stem" site at looks pretty nice, and seems to have the functionality I want: thumbnails, categories, and comments. However, I can't tell how automatic everything is: the thumbnails are tidily cropped squares from rectangular originals, and most (all?) of the images have the title superimposed somewhere in the image. Both of these neat features seem like they'd have to be at least partly manual to me. In any case these two sites are the property of their owners, and I don't know if either of them intends to share the underlying code.

So, of those that have photo sites, what publically available tools work well?

Thanks for any suggestions :)

Friday July 11, 2003
09:08 AM

Office. It's good for shit.

Is it me, or is this man (mirrored here, but please be merciful of my limited bandwidth) demonstrating that the best place to use office is, well, you "other" office?
Monday June 23, 2003
10:11 PM

Can't launch the .pkg for new iChat

What's with this? The .dmg disc image file for the new iChat beta generates a .pkg installer for the new version, but when I open it the thing just flickers in the dock before going away without having done anything. If I tail the logs in, I see

2003-06-23 23:04:11.366 Installer[7012] Unable to load nib file: PagedInstaller.nib, exiting

I don't get it. There's no .nib files inside the .pkg directory, but I'd think that if there should be it would be part of the distribution, and apparently other people have been able to get the thing installed. Is this somewhere else on my computer then? It's not in either of my personal ~/Library or the system-wide /Library or /System/Library directories. Google has no idea what a "pagedinstaller" is, and neither do I.

Has anyone else seen a bug like this?

Workaround found: Based on a discussion at, I found out that /usr/sbin/installer (see man installer) provides a command line interface:

% sudo installer -pkg ~/Desktop/iChatAVBeta.pkg -target /
installer: Package name is iChat AV Public Beta
installer: Installing onto volume mounted at /.
installer: The install was successful.

Not that that resolves the original question, but now [a] iChat is upgraded, and [b] I have a more "natural" (to me) way to install .pkg applications. So I'm mostly happy now...

09:46 PM


I've just received a copy of the June 2003 edition of ;login:, Usenix & Sage's journal.

My writeup of the MIT Spam Conference is on page 65 :-)

It looks like the permanent URL of the article will be, but that won't work until the next issue after this one comes out.

Thursday June 19, 2003
01:14 AM

*cough* *weezer* *cough*

Anna one --

OOO weee ooh I hack Perl like Buddy Hobbit...
Oh oh and you're Galadriel
I don't care what they say about this stupid ring
I don't care 'bout that

Anna two --

What's with these Ringwraiths, dissing my Shire?
Why do they gotta front?
What did we ever do to these guys
That made them so violent?
My-pre-cious, but you know I'm yours
My-pre-cious, and I know you're mine
My-pre-cious, and that's for all time

Oo-ee-oo I look just like Bilbo Hobbit
Oh-oh, and you're Galadriel
I don't care what they say about this stupid ring
I don't care bout that

Don't you ever fear, I'm always near
I know that you need help
Your tongue is twisted, your eyes are slit
You need Gandalf the Grey
My-pre-cious, but you know I'm yours
My-pre-cious, and I know you're mine
My-pre-cious, and that's for all time

Oo-ee-oo I look just like Bilbo Hobbit
Oh-oh, and you're Galadriel
I don't care what they say about this stupid ring
I don't care bout that
I don't care bout that

Bang, bang a knock on the door
Another big bang and you're down on the floor
Oh no! How do we feel?
Don't look now but I missed my meal
I can't run and I can't kick
What's a matter Bor' are you feeling sick?
what's a matter, what's a matter, what's a matter you?
What's a matter Bor', are you feeling blue? oh-oh!
And that's for all time
And that's for all time

Oo-ee-oo I look just like Bilbo Hobbit
Oh-oh, and you're Galadriel
I don't care what they say about this stupid ring
I don't care bout that
I don't care bout that
I don't care bout that
I don't care bout that

Anna three --

My name is Frodo
I'm carrying the ring
Thanks for all who've read us
But now we're in movies!
Come sit next to me
Pour yourself some tea
Just like Bilbo made
Before he ever found rings
Things were better then
Once but never again
We've all left the den
Let me tell you 'bout it

The Fellowship left right on time
Ringbearing costs only your mind
The Wizard said, 'Hey man, we go all the way'
Of course we were willing to pay

Our name is Smeagol
We gotta pocket full of your fishes
They're fresh out of water
But they're still makin' wishes (makin' wishes)
Tell us what to do
Now that Orcs have come --
Now with Elephants!
And you know what else?
Guess what we received
Out of Mordor today
Words of deep concern
From my little pressioussesss

In Rohan it's not going as we planned
Their warlord has fallen betranced
The great Ents will not clear a path
The wizard swears he learned his math

The Uruk-hai are leaving home
The Uruk-hai are leaving home
The Uruk-hai are leaving home
The Uruk-hai are leaving home

The Uruk-hai are leaving home
The Uruk-hai are leaving home
The Uruk-hai are leaving home
No! No! No!

My name is Frodo

Anna okay I'll stop now :-)

Really though, I think I would have actually liked all the singing bits in LOTR if they'd been done to some kind of crunchy AC-DC/Ramones/Weezer guitars. Nevermind the Boll^H^H^H^HZeppelin, a quick glance at Weezer's lyrics shows lots of places for funny LOTR riffs... :-)

Tuesday May 20, 2003
11:44 PM

Buffy ends

Okay, Buffy is over.

Can we please finish Perl 6 now?

Sunday January 19, 2003
05:04 AM

Spam Conference notes

Update: A condensed version of this writeup appeared in the June 2003 issue of ;login:, the magazine of Usenix & SAGE. I'm a happy camper :-) A PDF of the article has been made available.

I was waiting for the review to show up on Slashdot, as the conference was really good. The audio proceedings have been put online, but I'm not sure if they can take a Slashdotting, so please be gentle :) If you have 8 hours to spare, the whole day was pretty good & worth listening to, but the schedule as planned isn't exactly the sequence people spoke in, so you may have to jump around the RealAudio stream a little bit.

Turning my notes for the day into something vaguely coherent, here are some hightlights from the proceedings. There are a couple of speakers that I didn't write anything down for, but from mid-morning on this should be pretty comprehensive. Apologies in advance if my notes lead me to attribute certain comments to the wrong speaker -- if anyone notices any mistakes please feel free to add corrections:

  • Bill Yerazunis - CRM114 & MailFilter

    Because Perl "freaks him out", Yerazunis came up with the CRM114 minilanguage (points for anyone that gets the joke in the name without googling for it :), then wrote MailFilter in CRM114 as an implementation of a filter that can be used with Procmail or SpamAssassin or what have you. The basic idea is to decompose a message into a set of "features" composed of various permutations of single words, consecutive words, words appearing within a certain distance of one another, etc, such that the set of features N is very much bigger than the set of words X. You then analyze the features in various ways and if you get above a certain arbitrary threshold, you flag the message as spam & handle it accordingly.

    He claimed that with this software he could get better than 99.9% accuracy in nailing spam, and a similar percentage in avoiding "ham" (the term everyone was using for false positives -- legit mail that was falsely identified as spam). One of Yerazunis' observations is that the best way to defeat the spam problem is to disrupt the economics: if a 99.9% or better filter rate were to become the norm, then the cost of delivering spam can be pushed higher than the cost of traditional mail and the problem will naturally go away without requiring legislation (which would be nice anyway, but we can't count on it).

    The drawback of CRM114/MailFilter is that it can only handle about 20k of text per second, so it's not appropriate for large scale use yet. Still an interesting project to watch though:

  • John Graham-Cumming - POPfile

    Most of his very entertaining talk was about the ingenious tricks that spammers resort to to obfuscate spam against filters, including most diabolically one example that placed each column of monospace text in the message into an HTML column, so that the average HTML-capable mail client would render the message properly, but it would be absolute gibberish to most mail filters. The ultimate lesson was that any good filter has to focus not on "ascii-space" (the literal bytes as transmitted) but the "eye space" (the rendered text as seen by the user), which by extension may mean that any full scale spam parser/filter could also have to include a full-scale HTML & Javascript engine. Yikes!

    As for Graham-Cumming's software, it's a Perl application, available for all platforms (Windows, Mac, & of course Linux) that allows users to filter POP3 mail. Interesting stuff if you're a POP user:

  • John Draper - ShopIP

    Most of Draper's work seemed to be focused on profiling spammers, as opposed to profiling spam itself, by throwing out a series of honeypot addresses & using data collected to hunt down spammers.

  • Paul Judge, CipherTrust

    Judge's big argument, which no one really disagrees with, is that spam has become not just a nuisance, but an actual information security issue. To that end, he is advocating much more collaborative effort to address the problem than we have seen to date: conferences like this, mailing list discussions, better tools, and public data repositories of known spam [and ham]. To that last point, one of his observations (which others made as well) was that there are no universally agreed on standards for what qualifies as spam, so repositories for spam will not be accurate for all users (spam for your programmers will be the bread & butter of your marketing department, etc). Plus, there are obvious privacy issues in publishing your spam & ham for public scrutiny. And to add another wrinkle, one danger of public spam/ham databases is that spammers can poison them with false data, screwing things up for everyone. That said, he encouraged users to help out with building

  • Paul Graham

    The man who organized the conference and kicked everything this week off with his landmark paper from last fall, A Plan for Spam. Graham's spam filtering technique famously makes use of Bayesian statistics, a technique popular with nearly all of the speakers. The nice thing about a statistical approach, as opposed to heuristics, simple phrase matching, RBLs, etc, is that they can be very robust & accurate; the down sides are that they have to be trained against a sufficiently large "corpus" of spam (most techniques have this property though) and they have to be continually retrained over time (again, this is common). Graham was too modest to produce numbers, but subjectively his results seemed to be even better than what Yerazunis gets with MailFilter, by an order of magnitude or more.

    Like other speakers, he predicted that spammers are going to make their messages appear more & more like "normal" mail, so we're always going to have to be persistent about this -- as one example, he showed us an email he received IN ALL CAPS from a non-English speaker asking for programming help, and although it was legit, the filters insisted otherwise. "That message is the one that keeps me up at night."

    Everyone interested in the spam issue should go read Graham's paper immediately.

  • Robert Rothe, eXpurgate

    Rothe works for Eleven, an ASP company from Berlin selling a spam management service/application called eXpurgate. His talk was short on details about how the tool worked (mainly that it searches for bulk mail), focusing instead on the high level functionality it provides to users -- basically, they classify mail as safe, questionable, or dangerous, and let the users handle them accordingly. Another speaker that sees spam as a network security issue, so they built their system accordingly, with privacy of the client's mail content in mind etc.

    Like many speakers, he warned about the dangers of an anti-spam "monoculture": that Bayesian techniques might be great, but if that's all anyone uses then spammers will catch on and adjust their messages to look more like normal mail, to the point that Bayesian filters won't work anymore. As a result, we're going to need to attack the problem from several angles, using different techniques, to keep the spammers off balance as much as possible.

  • Matt Sergeant, SpamAssassin

    SA is a well known Perl application for heuristically profiling messages as spam, adding headers to the message saying for example "I am 72% sure this is spam because it has X Y Z", and passing off the message to procmail or whatever to be handled accordingly. SpamAssassin can handle a message throughput great enough that it can be deployed at the network level (whereas some of the others, which might have somewhat better hit rates, are still too inefficient at this point). Deployed this way, the differences in effectiveness for single vs. multiple users becomes very apparent, as 99% effective rates fall down into the 95-80% range. This happens because, again, different users define different things as spam, so mapping one fingerprint to all users can never work quite right. For an example of a tool that your company can deploy right now & get fast, decent results, SA looks like a good choice; but for the long run it looks like a Bayesian technique is going to get better performance, and SA is adding a statistical component to its toolkit. Good talk.

  • Barry Warsaw, Python Labs

    This was another example of the "monocultures are dangerous" philosophy, as Warsaw explained how he is helping to use a variety of anti-spam techniques -- from clever Exim MTA configuration to good use of Spam Assassin & Procmail to fine tuning of the MailMan mailing list engine -- to work together to manage the spam problem for all things Python (, Zope, many mailing lists, a few employees, etc).

    He pointed out that some very simple filters can be surprisingly effective: run a sanity check on the message's date; look for obviously forged headers; make sure the recipients are legit; scan for missing Message-Id headers; etc. In response to the person that originally posted the article, yes, he did mention blocking outgoing SMTP as an effective element of a many tiered spam management approach.

    Among other tricks for getting the different filtering tiers to play nice together, they make heavy use of the X-Warning header so that if an alarm goes off in one tier of their mail architecture, other components can respond appropriately. Cited projects included ElSpy and SpamBayes.

  • Barry Shein, founder & CEO of The World -- or as he laughingly put it, "President of the World". Har har har

    This talk was mostly a let down for me -- Shein has made his views very well known, and his ranting, rambling talk didn't really introduce any new ideas for anyone that had read that interview (some good jokes & quotes though).

    His core argument is that spam is "the rise of organized crime on the internet", that filters are nice but that the mail architecture itself is fundamentally flawed, and that ISPs like his -- in 1989, The World was the world's first dialup ISP -- are being killed by the problem. Shein was very annoyed that all these talented people are having to clean up a mess like this when we should be out working on more interesting stuff, and not having to worry about this issue. His big hope seemed to be that legislation will someday come to the rescue, but he sounded very pessimisstic. (Others in the room seemed to feel that this was a very interesting machine learning problem, and weren't really fazed by his pessimism -- but then most of the people in the room don't run ISPs.)

    He also suggested that we need to find a way to make spammers pay for the bandwidth they are consuming (rather than having users & ISPs shoulder the burden) but didn't seem to know how we might go about implementing this. At all.

    Fun rant to cheer along to, but for me it wasn't very constructive in the end.

  • Jean-David Ruvini, eLabs SmartLook

    This was an interesting product. Ruvini's company is developing an extension to Outlook 2000 & XP that will watch the way users categorize messages into folders, come up with a profile for what kinds of messages end up in which folders, and then try to offer similar categorization on an automatic basis. Think of it as Procmail for Outlook, without having to mess with (or even be aware of!) all the nasty recipies.

    Obviously if you have a spam folder, then spam will be one of the categories it looks for, but more broadly it will try to categorize all your mail as you would ordinarily categorize it. This makes SmartLook a broader tool than "just" a spam manager.

    SmartLook is another statistical filter, though it uses non-Bayesian algorithms to get results. eLabs' tests suggest that the product is able to properly categorize messages about 96% of the time, with no false positives, and (for their tests, mind you) that it performed better than Bayes filters over three months of usage.

    One nice property of this tool was that it works well with different [human] languages -- some strategies fall apart &/or need retraining when you switch from English to some other language. For certain markets (eLabs seems to be a European company, perhaps French?) this is a crucial feature, and having a tool that works with one of the biggest mail clients out there (most people don't use Mutt or Pine, sadly enough) can be very valuable. Very clever -- watch for the inevitable embrace & extend three years from now.

  • Eric Raymond

    He didn't say anything about guns, but he did try to correct one of the other speakers for misusing the term "hacker."

    Like Graham, ESR is a Lisp fan, but he knows that the vast majority of people aren't, and he also knows that the vast majority of people need to be using something like Graham's spam software. So on a lark, he came up with a clean version in C, named it BogoFilter, and put it on Sourceforge, where a community sprung up to, well, embrace & extend it.

    As good as Graham's Bayesian algorithm is, ESR felt -- as did many of the other speakers -- that the nature of your spam/ham corpus is much more significant than the relative difference among any handful of reasonably good algorithms. (Back to the often repeated point about how corpus effectiveness falls apart when used for a group of users, as opposed to individuals.) To that end, he strongly feels that the best way to deal with the spam problem is to get good tools into the hands of as many people as possible, and to make them as easy to use as possible (ahh, the old "open source UIs always suck" argument :). As an example, one of the first things he did was to patch the Mutt mail agent so that it had two delete keys: one for general deletion, one for "get rid of this because it's spam." That second key, and interface touches like it, seem like the way to get average people to start using filters on a regular basis.

  • Joshua Goodman, Microsoft Research

    Unlike ESR, Goodman felt that algorithm selection does make a big difference, but this being Microsoft he refused to disclose what algorithms his team is working with -- except to say that, when delivered, they will be more accessible for average users than SpamAssassin, Procmail recipies, or Mutt :)

    Microsoft has been working on the spam problem since 1997, but because of how big they are they've had unique problems in bringing solutions to market. As a case in point, they tried to introduce spam filters to a 1999 Outlook Express release, but were immediately sued by email greeting card company Blue Mountain because their messages were being inaccurately categorized as spam. With that in mind, they have been very reluctant to bring new anti-spam software out since then because they would like to see legislation protecting "good faith spam prevention efforts."

    As a very large player, Microsoft faced certain difficulties in developing useful filters -- it may make sense for you as an individual to filter all mail from Korea, but this doesn't work so well if you are trying to attract customers *from* Korea :). This has forced them to put a lot of work into thoroughly testing different strategies before offering them to the public.

    In spite of what millions of webmail users may have expected, Hotmail & MSN are currently being filtered by Brightmail's service, and plans are underway to reintroduce spam management features to client side software again. (Just imagine how bad it would be if they weren't paying someone to filter for them! Unfortunately, no hecklers piped up to ask if they are really selling Hotmail's user database to spammers, and if that is a source of annoyance for his team.)

    An interesting barrier his group has had to grapple with was what he called the "Chinese menu" or "madlibs" spam generation strategy: that it's easy to come up with a template for spam -- "[a very special offer] [to make your penis bigger] [and please your special lady friend all night!" vs. "[an exclusive deal] [for genital enlargement] [that will boost your sex life!]" etc -- and have a small handful of options for each 'bucket' multiplying into a huge variety of individual messages that are easy for a human to group together but almost impossible for software to identify.

  • Michael Salib, extremely funny MIT student

    Unlike nearly all other filter writers of the day, Salib's approach was heuristic: find a handful of reasonable spam discriminators, throw them all against his mail, and see how much he can identify that way. "It's sketchy, but this is a class project. I don't have to be realistic. [...] These results may be completely wrong."

    Much to his surprise, he's trapping a lot of spam. He pulls in a little bit of RBL data ("the first two or three links from Google, whatever"), looks for some patterns and so on, and then churns it through LMMSE, an electrical engineering technique that as far as he can tell doesn't seem to be known in other fields. Basically this involves running the messages through a series of scary-but-fast-to-calculate linear equations). It turns out that he can process this much faster than a Bayes filter, to the point that customizing his approach for each user in a network would actually be feasible.

    For a small spam corpus, he got results better than SpamAssassin did, though for a large corpus his results were worse; he couldn't really account for why this would be the case, or predict how things would scale as the corpus continued to grow.

    When questioned about the RBL tactic by a member of the audience [who was apparently familiar to Salib -- I don't know who it was] about whether authenticating remote users might be the answer, Salib's response was "yes, I agree, but then you *do* work for Verisign, who is in the verification business, so you would say that."

    Right on, Salib -- his talk was easily the funniest & breezy of the day :)

  • David Lewis, general researcher

    The core of Lewis' argument, as ESR said earlier in the day, is that for any machine learning technique the quality of the learning corpus is much more important than the algorithm used. Bayes is one such algorithm, but there are many other good ones in the literature. In a dig at Goodman's refusal to disclose algorithms, Lewis pointed out that all of this has been publicly discussed since the first machine learning paper was published in 1961.

    Observations: "lots of task inspecific stuff works badly, but task specific stuff helps a lot." It is important to use different corpuses [corpi?] for training and for general use, so that you don't train your machine to focus too much on certain types of input (this is a point that Microsoft's Goodman made as well).

    As Graham did, Davis emphasized that spam is going to slowly start looking more like natural text, and we're going to have to deal with this as time goes on.

  • Jon Praed, Internet Law Group

    To a burst of tremendous applause, this talk began with the sentence "my name is Jon Praed, and I sue spammers."

    He brought a legal take on the "not everything is spam to everybody" angle, emphasizing that we need a precise definition of what qualifies as Unsolicited Commercial Email (UCE). In particular, it has been difficult trying to pin down if the mail was really unsolicited, as this is where the spammers have the most wiggle room. However, if you can track down the spammer, they have to date rarely been able to verify that the user asked for mail, and so Praed has been able to successfully prosecute several spammers on this angle. He doesn't expect this to work forever though.

    According to Praed, "laws against spam exist in every state, and more are pending", but he doubts that a legal solution will ever be completely effective as long as spam is lucrative. By analogy, he pointed out that people still rob banks and that has never been legal.

    Praed informed the audience that there are several ways to get back at spammers, including injunctions, bankruptcy, and contempt, and all of these can be very effective. He pointed out that, to be blunt, a lot of these people are desperate low-lifes, and spam has been their biggest success in life. After these legal responses, their lives all get much worse. It hadn't occured to me to see spammers as pitiful before, but I can now. Most importantly, Praed stressed that these legal remedies can be very effective, and he strongly warned against taking vigilante action. This is almost always worse than the spam itself, and it only serves to get you in even deeper trouble than the spammer.

    Identifying the sources of spam, most comes from offshore spam houses, abuse of free mail accounts (Hotmail & Yahoo, free signups at ISPs, etc) and bulk software (which may apparently soon become illegal in certain areas, provided that a law can be found to ban spam software while allowing things like MailMan or MajorDomo). Interestingly, he questioned the idea that header spoofing is a big problem, and claimed that in every case he has dealt with he has been able to track down the messages to a legit source sooner or later.

    Suggestion: if you get a spam citing a trademarked product [e.g. Viagra], forward it to the trademark holder and they will almost always follow up on it. Suggestion: be fast in trying to track down spammers, as some of them have gotten in the habit of leaving sites up long enough for mail recipients to visit, but taking them down before investigators get a chance to take a look. Legal observation: spam is almost always fraud, and can be prosecuted accordingly.

    Praed wrapped up his talk by citing the encouraging precedent that the famous Verizon Online vs. Ralsky case set: [a] that the court is interested in where the harm occurs, not where the person doing harm was when causing it (so if you send spam to someone in Alaska and spam is a capital offence in Alaska, you can be tried as a citizen of that state even if you caused the harm from somewhere else), and [b] it is assumed that you have to be familiar with a remote ISPs acceptable usage policies, and ignorance is no defence (just as you can't say "I didn't know it was illegal to shoot someone", Ralsky couldn't say that he didn't know Verizon prohibits spam -- (he had to have known that the AUP wouldn't allow what he was doing, so he deliberately didn't read it)). That precedent makes future prosecution of spammers much more encouraging. While, again, legal solutions may never eliminate the spam problem, a precendent like this can be an important supplement to filtering efforts (the stick to the filter's carrot, or something -- my lousy analogy, not Praed's).

  • David Berlind, ZDNet executive editor

    His talk was primarily about how he receives a huge quantity of email from ZDNet readers, and he can't afford to use any spam filtering solution strategy that would allow *any* false positives. As one of the speakers said -- sorry, I forget who (Microsoft's Goodman?) -- getting a 0% false positive rate is easy: just classify nothing as spam. Getting a 100% hit rate is also easy: just classify everything as spam. Any solution besides those two is always going to have some degree of error either way, and determing how much of what kind of error you want to accept is up to you. Most users will tolerate a moderate false negative rate (some spam gets through) if it means that the false positive rate (legit mail is deleted) is very low. In Berlind's case, the false positive rate has to be vanishingly small, because reading all customer mail is a critical sign of respect for him.

    Further, his business is also a legitimate mass emailer, sending out millions of free newsletters to users every day, and if Shein's proposal to bill bulk mailers were to catch on then even a very low rate would quickly put his company in the red. One obvious solution, which wasn't mentioned: start charging a subscription for these mailings, and make them profitable. I don't want to see this happen but if it did then the economics would tilt back toward making things feasible again.

    Berlind is appreciative of the anti-spam work that is being done, but at the same time is skeptical of how pragmatic most of what is being proposed can really be. He feels we need a massive effort to rework the way mail is handled [Y2K anyone? It could get IT people back to work...], and to that end hopes ZDNet can help promote such a cooperative effort between the parties working on this. They don't want to be involved -- they are journalists & publishers, not standards developers -- but they are eager to get things going & want to cover the story as it progresses.

    Like Shein said, he feels it's a waste for all these talented people to be working on combating penis enlargement offers, and hopes that we can find a way to get past this and work on real problems, "like world peace." This comment got a chuckle from the audience, but he seemed like the kind of guy that really meant that, and more importantly, he was right. A smart guy like Paul Graham or Bill Yerazunis shouldn't have to waste time tinkering with how many Viagra offers he can automagically delete when there are more fun things to be doing.

  • Ken Schneider, Brightmail

    As mentioned earlier, Brightmail provides an ASP service for real time filtering of both incoming & outgoing mail. As would perhaps be expected, bigger ISPs and networks attract larger amounts of spam: 50% of mail coming into big ISPs and 40% coming into big companies is now spam. Brightmail offers the Probe Network, a <slashdot-killfile-term>patented</slashdot-killfil e-term> system of decoy honeypot addresses that gather data for analysis at their logistics center, which in turn distributes spam filtering rules to their clients where a plugin for $MTA (using the open source or proprietary MTA of the client's choice) can act on the database.

    An interesting property of their system is that they have a mechanism for both aging out dormant rules as well as for reactivating retired ones, so that the currently active ruleset can be kept as lean & effient as possible. A big source of difficulty for them is legitimate commercial opt-in lists, because things have gotten more shady & blurry over time and it's now hard to tell this mail from much of the spam out there. Whitelists help here, but the problem is still difficult.

After each speaker had his turn, there was a panel discussion, but not much really happened there, and the moderator cut things short after only a couple of minutes. The original plan was for everyone to go out for Chinese food afterwards and continue the discussions over dinner, but when 580 people signed up that plan obviously fell apart. :) And so, here ends the notes...