I have received two revisions (and may ultimately receive more) of a Word document specification (and I use that term loosely). The main part of this document I'm concerned with is a series of tables. I am extracting these tables into an Excel spreadsheet by cutting and pasting. I then process the spreadsheet with a custom Perl program that spits out YAML, SQL DDL, and a couple of other important goodies.
Obviously I'd like to eliminate the cut and paste part of this process. Besides being something I just don't want to do, it is error prone, slow, and difficult to consistently replicate.
Does anyone know of a way I can automate this extraction process? I'm willing to consider any language, if necessary, though of course I prefer Perl. I'm also willing to consider intermediate formats, such as converting to OpenOffice, AbiWord, or whatever. (My Excel spreadsheet is already an intermediate format.) I'd like any such conversions to also be automateable, but if I had to manually convert and then extract it would still shrink down the human-driven, error-prone, unreplicable part of this process by at least an order of magnitude.
Incidentally, I have reason to believe that the
Antiword may be of some help (Score:1)
I've used antiword [demon.nl] in the past for reading MS Word docs, but I don't know how well it reads tables. You might want to give it a try.
Re:Antiword may be of some help (Score:2)
Thank you! It looks like antiword converts to XML and/or DocBook, so maybe I can go that route. It says the support is still experimental, but I'll check it out. Even if it doesn't work today, it may work at some point in the future.
J. David works really hard, has a passion for writing good software, and knows many of the world's best Perl programmers
Re:Antiword may be of some help (Score:2)
Awesome!!! This is entirely feasible! Thank you!
The tables come out into elements called <informaltable>. I can parse that XML, extract those, and convert them. In fact it looks like this is better than going to Excel because going to Excel provides several "phantom" blank cells which I have to ignore in my current program.
I'm not sure if I'm going to have to do this specific file again, but there's a good chance I might, and if I do I will attempt to program this process. If I don't for this f
J. David works really hard, has a passion for writing good software, and knows many of the world's best Perl programmers
Re:Antiword may be of some help (Score:1)
Glad it's working out for you. I haven't used antiword in over a year but it was very helpful when I needed it.
Re:Antiword may be of some help (Score:1)
Win32::OLE (Score:2)
Re:Win32::OLE (Score:2)
Thanks for the pointer. Maybe I can do this entirely in pure Perl, and drop any intermediate file formats. :)
J. David works really hard, has a passion for writing good software, and knows many of the world's best Perl programmers
Win32::OLE + XML/HTML (Score:1)
use Win32::OLE;
sub wdFormatHTML {8}
sub wdFormatXML {11}
my $msword = Win32::OLE->new("Word.Application");
my $doc = $msword->Documents->Open($src_name);
$doc->SaveAs($target_name, wdFormatXML);
Re:Win32::OLE + XML/HTML (Score:2)
Thank you for the concrete example. That looks like it may work very well.
J. David works really hard, has a passion for writing good software, and knows many of the world's best Perl programmers