Slash Boxes
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Purdy (2383)

  reversethis-{ofni.ydrup} {ta} {nosaj}
AOL IM: EmeraldWarp (Add Buddy, Send Message)
Yahoo! ID: jpurdy2 (Add User, Send Message)

Bleh - not feeling creative right now. You can check me out on PerlMonks [].

Journal of Purdy (2383)

Tuesday August 15, 2006
09:14 AM


[ #30632 ]

I recently had a situation come up where I had to whip up some code to split up a huge (1 GB) mbox file. I KNOW I should be using mdir, but com'on, people ... it's what Debian does by default and I don't spend my time sysadmin'ing stuff. In looking around, I couldn't believe others hadn't already done this (perhaps they have and my Google-fu just wasn't adequate). There was a promising git-mailsplit program, but I couldn't find it in Debian.

So I whipped this up - feel free to use/tweak this for your own use:

#!/usr/bin/perl -wT

# Process:
# 1) cp /var/mail/person /var/mail/person.bak
# 2) Run this script
# 3) chmod/chown the INBOX.GigSplitNN files
# chown person:users /home/person/INBOX.GigSplit*
# chmod 0600 /home/person/INBOX.GigSplit*
# 4) mv /var/mail/person /var/mail/person.prerm
# 5) mail the person and see if the /var/mail/person gets setup right
# 6) diff /var/mail/person.bak and /var/mail/person.prerm and put that in /var/mail/person
# i ended up just tailing the file with the right number of differing lines
# and >>'ing that into /var/mail/person
# b/c diff'ing two 1GB files takes WAY too long!

use strict;

open( MBOX, '/var/mail/person.bak' ) || die "Cannot open person.bak: $!";

# go through the mbox file
my $message = '';
my $line_count = 0;
my $message_count = 0;
my $file_base = '/home/person/INBOX.GigSplit';
my $file_i = 1;
my $line_count_limit = 580000; # this ends up with ~40MB files, which are more tolerable
my $need_to_write_init = 1;

while( <MBOX> ) {
        if ( /^From / ) {
                if ( length( $message ) > 0 ) {
                        my $file = $file_base . sprintf( "%02d", $file_i );
                        print "Got message # $message_count - appending to $file ...\n";
                        if ( $need_to_write_init ) {
                                write_initial_msg( $file );
                                $need_to_write_init = 0;
                        open( SPLIT, ">>$file" ) || die "Cannot append to $file: $!";
                        print SPLIT $message;
                        close( SPLIT );
                        if ( $line_count > $line_count_limit ) {
                                print "Line Count exceeded $line_count_limit, so incrementing \$file_i...\n";
                                $line_count = 0;
                                $need_to_write_init = 1;
                $message = $_;
        } else {
                $message .= $_;

close( MBOX );

print "All done!\n";

sub write_initial_msg {
        my $file = shift;
        open( FILE, ">$file" ) || die "Cannot open $file to put in initial msg: $!";
        print FILE <<"_EOF_";
From MAILER-DAEMON Mon Aug 14 13:00:31 2006
Date: 14 Aug 2006 13:00:31 -0400
From: Mail System Internal Data <MAILER-DAEMON\>
Message-ID: <1155574831\>
X-IMAP: 1134739889 0000025473
Status: RO

This text is part of the internal format of your mail folder, and is not
a real message. It is created automatically by the mail system software.
If deleted, important folder data will be lost, and it will be re-created
with the data reset to initial values.

        close( FILE );

So that will create INBOX.GigSplit01 ... INBOX.GigSplitNN, which my user could manage with Squirrelmail (I had to hack /home/person/.mailboxlist to add those new folders). Since the problem stemmed from a checkbox in her email client keeping old messages on the server and not removing them, she could simply delete a lot of the stuff as redundant and just look at the more recent messages for stuff she missed. Remotely accessing a 1GB mbox file tends to timeout. ;)

Yes, I KNOW that could be optimized and probably even one in one line (go for it, golfers!) ... it was something I had to do and it wasn't too painful to run (6700 msgs in 2 minutes).

That's just the way I roll!* ;)

Speaking of coding, Google has their Code Jam going on, but where's the love for Perl? You can program in C++, C#, Java, Python and VB.NET, but not Perl. It probably has to do with what TopCoder supports, but something should really be done to get Perl in that list, for longevity sake.



* = My new favorite saying

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
More | Login | Reply
Loading... please wait.
  • From the man page for formail:

                  formail is a filter that can be used to force mail into mailbox format,
                  perform ‘From ’ escaping, generate auto-replying headers, do simple
                  header munging/extracting or split up a mailbox/digest/articles file.
                  The mail/mailbox/article contents will be expe
    • Thanks ... it looks like it could do the trick, but upon closer examination, formail will split it into separate messages, but not chunked mbox files of a specified size. 6700 individual message files vs. 25 mbox files.

      - Jason
  • Hi. I discovered the script stores mails every time it encounters a mail beginning (^From ). So, the last e-mail is not stored since there is no "NEXT" mail. I fixed moving:

            if ( length( $message ) > 0 ) {
                    $file = $file_base . sprintf( "%02d", $file_i );
                    #print "Got message # $message_count - a