What am I working on right now? Probably the Sprog project [sourceforge.net].
GnuPG key Fingerprint:
6CA8 2022 5006 70E9 2D66
AE3F 1AF1 A20A 4CC0 0851
I've just finished converting one MS Word document of about 150 pages to a group of about a dozen HTML pages. What a nightmare!
Last time I had to do this, the output was closer to 100 HTML files with very little formatting so I ended up scripting much of it. I loaded the document into Open Office and saved as ODT. Then I used XPathScript to spit out a series of very plain HTML files. With the site stylesheet applied they looked very smart.
This time, the document structure didn't really lend itself to scripting and there was more formatting that I wanted to preserve (eg: headings, bullet lists, simple tables). So I used 'Save as Web Page' from Word and then did most of it manually with Vim.
The HTML that Word produced was unspeakably vile. All sorts of illegal constructs (e.g.: a <p> inside a <span> inside another <p>!); enormous sections of proprietary markup inside comment markers; kilobytes of unnecessary attributes (align="left" on every <p>); and invented markup tags (eg: <place> and <placetype>).
I was able to strip out much of the cruft with LibXML/XPath/DOM manipulations and some search and replace regexes in Vim. But the result was still gruesomely awful. Much of it came down to operator error on the part of whoever typed the document:
As one of my colleagues commented, people should not be allowed to use a Word Processor without a license.
It can't all be blamed on users though. The user interfaces of Word and Open Office are absolutely awful. They make it far too easy to do the wrong thing, by jamming the screen full of toolbar buttons and menus. Conversely they make it hard to do the right thing by hiding the style selection tool in amongst all that visual clutter. Each new release over the years seems to have made the problem worse. What these interfaces need is less, not more.