Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

jdporter (36)

Journal of jdporter (36)

Wednesday July 24, 2002
11:40 AM

perl lists in mbox

[ #6606 ]

Wanting to get caught up on various mailing lists, I thought I'd just download the archives in mbox format and view them in my local mail reader.

Not so fast.

Turns out the mailing lists are managed with ezmlm, which stores the messages in a format other than mbox.

So I asked Ask what it would take to bring that capability to reality. He told me, and I volunteered to do it. So I was up until about 2:30 last night banging out a script which converts from maildir format to mbox.

I thought about trying to use the tools that come with ezmlm, or maybe the Mail::Ezmlm wrapper module... But it turns out that this conversion is too simple to require that approach. Basically, every message is stored in its own file, and the files are shoved down in a directory structure that seems to have no more purpose than to keep any given directory from having too many files in it. Also, each directory contains an index file.

So, essentially, converting that to mbox format takes nothing more than concatenating all the files of interest into one file.

But there are (of course) a few glitches.

First and foremost, the messages as stored do not have the initial 'From ' line. So, I process each file through formail -a Date:, which fabricates a 'From ' line, containing the info from the Date: header.

Then I process each file through a little loop of perl which parses the date (using the handy Date::Manip module) and reformats it into the correct format required for 'From ' lines.

The second issue is filtering down the messages to just those of interest. In this case, I want to select only those whose date is in a given year/month timeframe (specified by the user).

Anyway, the script works pretty nicely. It's not particularly fast. In my test, it takes 13 seconds (user) to read in 1300 messages and process 200 messages through formail out to the mbox. The main overhead is probably spawning and reading from formail. Considering how little it actually does, I should probably replace it with some perl. Some more performance hit probably comes from using Date::Manip to parse and reformat dates. But man, the flexibility! It can understand just about anything you throw at it. For example, to convert the current month's messages, you can specify "today" rather than an explicit year and month.

O.k., replaced the formail bit with perl code to create the 'From ' line. Shaved about 10% off the time. But remember that this step is only done for the messages which match the interest criteria. In general, the number of matching messages will be much less than the total number of messages in the archive.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • qmail-scanner, iirc, comes with a maildir2mbox script, and I think I've got one somewhere as well. Personally though, I prefer maildirs =) [That's religious territory there =)]
      ---ict / Spoon
    • Yeah... but this is perl. :-)

      I prefer maildirs

      Sure, maildirs are nice -- better than mboxes in most ways. But mutt uses mboxes, and I like mutt!

      • *cough*

          set mbox_type="Maildir"

        Mutt uses many things =) I also love mutt.

          ---ict / Spoon
        • :-)
          Somehow I knew that would turn out to be the case.

          You know mutt has a jillion features, it's hard to comprehend them all.

          O.k., so, lemme aks ya:

          How do I direct mutt to set the outgoing envelope "From" address to a specific value? I thought setting the EMAIL environment variable would do it, but apparently not so.

          • Damn. The one question I'd love to know the answer to is the one you ask.

            As far as I know, it's set by the MTA rather than any MUA. So it probably depends on your MTA and its capabilities. Definitely a useful thing to be able to change, just because of YahooGroups if nothing else.
              ---ict / Spoon
          • Found it (mentioned on a mutt list a few minute sago).

            In your muttrc, add "set envelope_from=yes".

            Should do the trick. See the manual, naturally, if problems persist.
              ---ict / Spoon
    • ezmlm archives are not maildirs. (or mboxes).

          - ask

      -- ask bjoern hansen [], !try; do();

      • You are quite right. I was mistaken to call them maildirs in the first place.

        Still, except for the directory structure required by the maildir format, the ezmlm archives are very similar to maildir. That is, the message files themselves look exactly like maildir message files.
        • AIUI, if you change conversion code from using 'cur', 'new' and 'tmp' to just using any directories that match /^\d+$/ then the conversion works.

          As you say, they're similar enough to not really matter.

            ---ict / Spoon
  • This sounds far too complicated.

    The list messages may be stored in a maildir, but you don't need to rely on that.

    Ask has made the full archives of the mailing lists available through NNTP at Presumably you're going to be doing a bulk mbox conversion at some point, and might want to grab new messages at some point in the future. Grabbing each message over NNTP may be a drag, but it's damn easy. And adding a well-formed "From " line at the beginning of each message is just anoth