Stories
Slash Boxes
Comments
NOTE: use Perl; is on undef hiatus. You can read content, but you can't post it. More info will be forthcoming forthcomingly.

All the Perl that's Practical to Extract and Report

use Perl Log In

Log In

[ Create a new account ]

grantm (164)

grantm
  (email not shown publicly)
http://www.mclean.net.nz/

Just a simple [cpan.org] guy, hacking Perl for fun and profit since way back in the last millenium. You may find me hanging around in the monestary [perlmonks.org].

What am I working on right now? Probably the Sprog project [sourceforge.net].

GnuPG key Fingerprint:
6CA8 2022 5006 70E9 2D66
AE3F 1AF1 A20A 4CC0 0851

Journal of grantm (164)

Tuesday September 25, 2007
04:06 AM

License to Process Words

[ #34536 ]

I've just finished converting one MS Word document of about 150 pages to a group of about a dozen HTML pages. What a nightmare!

Last time I had to do this, the output was closer to 100 HTML files with very little formatting so I ended up scripting much of it. I loaded the document into Open Office and saved as ODT. Then I used XPathScript to spit out a series of very plain HTML files. With the site stylesheet applied they looked very smart.

This time, the document structure didn't really lend itself to scripting and there was more formatting that I wanted to preserve (eg: headings, bullet lists, simple tables). So I used 'Save as Web Page' from Word and then did most of it manually with Vim.

The HTML that Word produced was unspeakably vile. All sorts of illegal constructs (e.g.: a <p> inside a <span> inside another <p>!); enormous sections of proprietary markup inside comment markers; kilobytes of unnecessary attributes (align="left" on every <p>); and invented markup tags (eg: <place> and <placetype>).

I was able to strip out much of the cruft with LibXML/XPath/DOM manipulations and some search and replace regexes in Vim. But the result was still gruesomely awful. Much of it came down to operator error on the part of whoever typed the document:

  • every alternate paragraph empty to provide vertical whitespace
  • strings of empty paragraphs to push content onto a new page
  • 'tables' of data with columns aligned using spaces
  • 'bulleted lists' created by inserting a bullet character at the start of each line
  • vast sections of body text in 'Heading 2' style with the font overridden to give normal looking text
  • most actual headings rendered using bold and/or font changes

As one of my colleagues commented, people should not be allowed to use a Word Processor without a license.

It can't all be blamed on users though. The user interfaces of Word and Open Office are absolutely awful. They make it far too easy to do the wrong thing, by jamming the screen full of toolbar buttons and menus. Conversely they make it hard to do the right thing by hiding the style selection tool in amongst all that visual clutter. Each new release over the years seems to have made the problem worse. What these interfaces need is less, not more.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • Dreamweaver has this great little menu entry called "Clean up Word HTML".

    It's not perfect, but generally when I need to do Word cleanup, I run it through that filter first, then do the rest by hand from there.
  • There is also HTML-Tidy, which is available as a plug-in to many GUI HTML editors, e.g. HTML-Kit or Quanta+. It's a small c library and there is a small executable available for command line use and there is of course a Perl module based on it, HTML::Tidy.

    As much as I loath and detest Microsoft products, what scares me even more is that people who love Microsoft have no idea how to use their products at all. It's hardly surprising that most Windows machines are infected with something or that when I re

    --
    -- "It's not magic, it's work..."