Sometimes I need to read Microsoft Word documents, but I do not have Word, or any other word processor. I do not process words really. Indeed, several Word attachments are still in my inbox waiting for me to care enough to extract and download them.
I am a plain text sort of person. I read Word documents with my favorite text editor. I can ignore most of the jibberish and the rest of the content comes out just fine.
I cannot see the images, though. Most of the time this does not matter because the images do not have any information I cannot get from the text. A couple of documents I read last week said "See the formula in Figure N.M", and I really did need to see the formula.
What to do? Buy Word? Yeah, right. Install something that can import Word documents? Too much work. Readjust reality so I do not need to see the formula? Not in this case.
Looking into my bags of tricks I notice Google and Perl. Google tells me Word stores its images as embedded PNG strings. Perl lets me right fancy regular expressions. I think I have a winner.
Why spend money when I can get away 20 lines of Perl? Read in the data, look for a PNG string, save it to a file, and try again where I left off. Easy peasy.
my $HEADER = "\211PNG";
my $FOOTER = "IEND\xAEB`\x82";
foreach my $file ( @ARGV )
{
print "Extracting $file\n";
(my $image_base = $file) =~ s/(.*)\..*/$1/;
my $data = do { local $/; open my( $fh ), $file; <$fh> };
my $count = 0;
while( $data =~ m/($HEADER.*?$FOOTER)/sg )
{
my $image = $1;
$count++;
my $image_name = "$image_base.$count.png";
open my $fh, "> $image_name" or warn "$image_name: $!", next;
print "Writing $image_name: ", length($image), " bytes\n";
print $fh $image;
close $fh;
}
}
The next time I care, I will probably figure out how to extract the captions so I can use those for the file names. I would rather have a name like figure-n.m.png than file.i.png. At the moment I do not care. If I got really fancy I could write a word2html converter.
What's next? Reading Excel files with more, but that's easy with SpreadSheet::ParseExcel.