Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

jdavidb (1361)

jdavidb
  (email not shown publicly)
http://voiceofjohn.blogspot.com/

J. David Blackstone has a Bachelor of Science in Computer Science and Engineering and nine years of experience at a wireless telecommunications company, where he learned Perl and never looked back. J. David has an advantage in that he works really hard, he has a passion for writing good software, and he knows many of the world's best Perl programmers.

Journal of jdavidb (1361)

Tuesday February 28, 2006
11:53 AM

Extracting from .DOC

[ #28824 ]

I have received two revisions (and may ultimately receive more) of a Word document specification (and I use that term loosely). The main part of this document I'm concerned with is a series of tables. I am extracting these tables into an Excel spreadsheet by cutting and pasting. I then process the spreadsheet with a custom Perl program that spits out YAML, SQL DDL, and a couple of other important goodies.

Obviously I'd like to eliminate the cut and paste part of this process. Besides being something I just don't want to do, it is error prone, slow, and difficult to consistently replicate.

Does anyone know of a way I can automate this extraction process? I'm willing to consider any language, if necessary, though of course I prefer Perl. I'm also willing to consider intermediate formats, such as converting to OpenOffice, AbiWord, or whatever. (My Excel spreadsheet is already an intermediate format.) I'd like any such conversions to also be automateable, but if I had to manually convert and then extract it would still shrink down the human-driven, error-prone, unreplicable part of this process by at least an order of magnitude.

Incidentally, I have reason to believe that the .DOC I'm receiving was converted by someone else from .PDF. She hasn't shared details with me on what software she used to accomplish that, but I'd also like to learn that feat, too, if anyone knows. I'd also be interested in learning to program this extraction from .PDF, if it's even possible.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • I've used antiword [demon.nl] in the past for reading MS Word docs, but I don't know how well it reads tables. You might want to give it a try.

    • Thank you! It looks like antiword converts to XML and/or DocBook, so maybe I can go that route. It says the support is still experimental, but I'll check it out. Even if it doesn't work today, it may work at some point in the future.

      --
      J. David works really hard, has a passion for writing good software, and knows many of the world's best Perl programmers
    • Awesome!!! This is entirely feasible! Thank you!

      The tables come out into elements called <informaltable>. I can parse that XML, extract those, and convert them. In fact it looks like this is better than going to Excel because going to Excel provides several "phantom" blank cells which I have to ignore in my current program.

      I'm not sure if I'm going to have to do this specific file again, but there's a good chance I might, and if I do I will attempt to program this process. If I don't for this f

      --
      J. David works really hard, has a passion for writing good software, and knows many of the world's best Perl programmers
  • If your on Win32 Win32::OLE [cpan.org] could help you.
    • Thanks for the pointer. Maybe I can do this entirely in pure Perl, and drop any intermediate file formats. :)

      --
      J. David works really hard, has a passion for writing good software, and knows many of the world's best Perl programmers
    • maybe you could save your doc in XML or HTML, and then parse the result with your favorite XSLT or regex tool. Something along the lines :

      use Win32::OLE;

      sub wdFormatHTML {8}
      sub wdFormatXML {11}

      my $msword = Win32::OLE->new("Word.Application");
      my $doc = $msword->Documents->Open($src_name);
      $doc->SaveAs($target_name, wdFormatXML);
      • Thank you for the concrete example. That looks like it may work very well.

        --
        J. David works really hard, has a passion for writing good software, and knows many of the world's best Perl programmers