Wanting to get caught up on various perl.org-based mailing lists, I thought I'd just download the archives in mbox format and view them in my local mail reader.
Not so fast.
Turns out the mailing lists are managed with ezmlm, which stores the messages in a format other than mbox.
So I asked Ask what it would take to bring that capability to reality. He told me, and I volunteered to do it. So I was up until about 2:30 last night banging out a script which converts from maildir format to mbox.
I thought about trying to use the tools that come with ezmlm, or maybe the Mail::Ezmlm wrapper module... But it turns out that this conversion is too simple to require that approach. Basically, every message is stored in its own file, and the files are shoved down in a directory structure that seems to have no more purpose than to keep any given directory from having too many files in it. Also, each directory contains an index file.
So, essentially, converting that to mbox format takes nothing more than concatenating all the files of interest into one file.
But there are (of course) a few glitches.
First and foremost, the messages as stored do not have the initial 'From ' line. So, I process each file through formail -a Date:, which fabricates a 'From ' line, containing the info from the Date: header.
Then I process each file through a little loop of perl which parses the date (using the handy Date::Manip module) and reformats it into the correct format required for 'From ' lines.
The second issue is filtering down the messages to just those of interest. In this case, I want to select only those whose date is in a given year/month timeframe (specified by the user).
Anyway, the script works pretty nicely. It's not particularly fast. In my test, it takes 13 seconds (user) to read in 1300 messages and process 200 messages through formail out to the mbox. The main overhead is probably spawning and reading from formail. Considering how little it actually does, I should probably replace it with some perl. Some more performance hit probably comes from using Date::Manip to parse and reformat dates. But man, the flexibility! It can understand just about anything you throw at it. For example, to convert the current month's messages, you can specify "today" rather than an explicit year and month.
O.k., replaced the formail bit with perl code to create the 'From ' line. Shaved about 10% off the time. But remember that this step is only done for the messages which match the interest criteria. In general, the number of matching messages will be much less than the total number of messages in the archive.