Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

Friday December 13, 2002
01:42 AM

Microsoft Word without Microsoft Word

[ #9431 ]

Sometimes I need to read Microsoft Word documents, but I do not have Word, or any other word processor. I do not process words really. Indeed, several Word attachments are still in my inbox waiting for me to care enough to extract and download them.

I am a plain text sort of person. I read Word documents with my favorite text editor. I can ignore most of the jibberish and the rest of the content comes out just fine.

I cannot see the images, though. Most of the time this does not matter because the images do not have any information I cannot get from the text. A couple of documents I read last week said "See the formula in Figure N.M", and I really did need to see the formula.

What to do? Buy Word? Yeah, right. Install something that can import Word documents? Too much work. Readjust reality so I do not need to see the formula? Not in this case.

Looking into my bags of tricks I notice Google and Perl. Google tells me Word stores its images as embedded PNG strings. Perl lets me right fancy regular expressions. I think I have a winner.

Why spend money when I can get away 20 lines of Perl? Read in the data, look for a PNG string, save it to a file, and try again where I left off. Easy peasy.

my $HEADER = "\211PNG";
my $FOOTER = "IEND\xAEB`\x82";
 
foreach my $file ( @ARGV )
    {
    print "Extracting $file\n";
    (my $image_base = $file) =~ s/(.*)\..*/$1/;
 
    my $data = do { local $/; open my( $fh ), $file; <$fh> };
 
    my $count = 0;
 
    while( $data =~ m/($HEADER.*?$FOOTER)/sg )
        {
        my $image      = $1;
        $count++;
        my $image_name = "$image_base.$count.png";
        open my $fh, "> $image_name" or warn "$image_name: $!", next;
        print "Writing $image_name: ", length($image), " bytes\n";
        print $fh $image;
        close $fh;
        }
 
    }

The next time I care, I will probably figure out how to extract the captions so I can use those for the file names. I would rather have a name like figure-n.m.png than file.i.png. At the moment I do not care. If I got really fancy I could write a word2html converter.

What's next? Reading Excel files with more, but that's easy with SpreadSheet::ParseExcel.