Title: Automatically Generating Word Documents

Author:

Date: 03 April 2005 13:16:11 +01:00 or Sun, 03 April 2005 13:16:11 +01:00

Summary:

Body:

I like to write my documents without WYSIWYG (what you see is what you get). One reason for this is that I want my mind to concentrate on what the document actually says, rather than on trivial details about what it looks like.

There are good typesetting systems, such as LaTeX, that can make the text look nice by putting the line breaks in the best places, putting the right amount of space between the words, and so on, so why should I let such things distract me when I'm trying to write creatively?

Another reason for not using WYSIWYG is that a document becomes more like a program. If you are editing markup, you are editing the "source code" which can then be "run" in various ways. You might transform your markup (using various tools) into different formats, such as an HTML version for online viewing and a TeX version for printing.

If you later find something wrong then you can change the source code and all versions will be changed automatically when you run your script to re-generate them. You can also use the features of your favourite editor, not to mention obtaining the assistance of your filing system by writing the different sections in files and directories and getting the script to merge them when generating the output (and the order in which they are included can be changed quickly).

Your script might also include some code to write certain parts of the document automatically, for example, test results can be generated on-the-fly by running the test program (and any changes to that program will automatically be reflected in the document). In short, you can relax without having to worry about keeping everything consistent and up-to-date, because the computer does all that for you (well, most of it anyway).

But there's one problem with this approach. What if you're writing a technical paper for publication, and the editor says, "you must submit it in Microsoft Word format"? There are many ways of converting other formats into Word, but usually the editor will go further and say "and it must look like this example" or "it must use that template". This means that, after you've got the document into Word format (by converting from HTML or whatever), you usually have to do some further work inside Word to make it look like what the editor wants. And then if you later need to make more changes, you either have to make all future changes inside Word (which means sacrificing all the benefits of the source-code approach, not to mention being more restricted in your choice of operating system and working environment, and causing problems if you want to publish in other forms and Word's conversion is not nice enough), or you have to repeat the whole process of getting it into Word and making the adjustments all over again as many times as is needed. If any of this annoys you, then that may negatively affect the quality of your work, so it's worth doing something about it.

One option is to become an expert in Microsoft Word's macro system and use that, either to automate the process of getting your document into Word, or, if you're a real macro expert, use it to drive the whole approach instead of using the scripting language of your choice. However, this does have its disadvantages. You have to be very good with the macro system in order to make your macros robust (it's too easy to make some small change that accidentally means the macro won't work any more), and if your only reason for using Word is to fulfill the requirements of some editor, why bother to learn Microsoft's product-specific language if there's some way of doing the same job using Python or whatever other general scripting language you're already skilled at?

The approach that I eventually found was this. Modern versions of Microsoft Word have an interesting way of saving documents as HTML. Enough information is encoded in the HTML for Word to re-read the document exactly in nearly all cases (if there is something that cannot be saved in the HTML then Word should tell you what that is at the time of the save). Now, there are various views about the quality of Word's HTML in terms of Web standards and portability, but if you regard Word's version of HTML as an alternative file format for Word documents, there is hope for your automatic program. Simply save the editor's example as HTML, have your program generate HTML like that example, and import this into Word whenever you need to produce a Word version; no post-editing in Word should be necessary. You may even be able to use diff and other utilities to find how the Word document has been changed by others (if they haven't tracked the changes) by converting it back to HTML and comparing that with what you sent.

I wasn't able to test enough versions of Word to work out exactly when this feature was introduced; the HTML functionality of Word 97 does NOT preserve all formatting, but Word 2002 does. If you do have access to a real copy of Microsoft Word, as opposed to something like OpenOffice.org, then it's best to use the real Microsoft product in order to maximise your chances that the conversion will go without a hitch, especially when converting back to Word format. If you don't have access to Word then you might be able to get away with asking the person who wants the Word document to save their example as HTML and import your HTML reply, although this requires more skill on their part especially if you have separate image files.

Reducing the Clutter

Word's HTML may be verbose, but it's not difficult to understand which part does what (especially with syntax highlighting), and it's much easier to work with Word's HTML than it is to work with its RTF (how many text editors with special facilities for navigating around RTF source do you know?) Modern versions of Word store the document's styles in the HTML header, using the CLASS attribute to assign HTML elements to styles and using stylesheet overrides to show other formatting.

The first thing to realise is that the only markup that matters for our purposes is the body section of the document. Everything up to and including the line beginning <body can be included without change at the start of the output document. You can do this in Python as follows:

template = open(template_html_filename).read()
text = open(my_html_filename).read()
out = open(output_filename, "w")
out.write(template[:re.match(r"<body[^>]*>",
                                   template).end()])
out.write(text[re.match(r"<body[^>]*>",text).end():])

which means you can edit (or automatically generate) your HTML without worrying about what comes before the <body> tag. You could adapt this so as to paste in text up to any point in the document you like by inserting special keywords to indicate that point in both the template and your document; this might be useful if the editor requires a complex header (name, address, etc, in an unusual format) that is hard to convert but that will rarely (if ever) have to change during the editing process.

The remaining "clutter" in the template may include backward-compatibility markup (you can safely replace the regular expression <!\[if !support.*!\[endif\]> with nothing), and markup for smart tags and spelling and grammar alerts (you can also safely remove this, but it's easier if you can tell Word not to save it in the first place by unchecking "embed smart tags" and "embed linguistic data" in the "save options" dialogue and turning off spelling and grammar check in the "language" dialogue).

The template you have to work with is still a little too cluttered though. There is usually a gratuitous amount of markup to represent the skips between paragraphs, and in some documents you will find rather a lot of markup that overrides the language of each part of the text. If you use search-and-replace to simplify this, you can then edit (or automatically generate) the simplified version and then reverse all your searches and replaces to get back to Word's version. That can all be scripted.

It may help if you are working with no end-of-line markers but with end-of-paragraph markers. You can get the text into that state by using code such as:

paragraph_token = "-- PARAGRAPH"+chr(0)
template = template.replace("\n\n", paragraph_token) \
.replace("\n"," ") .replace(paragraph_token, "\n\n")

When you are writing your search-and-replace list, one thing that helps to begin with is to see a list of the most common markup. You can do this as follows. First, get a list of all the tags that open elements:

openingElems = re.findall("<[^/][^>]*>", template)

then generate a frequency count and print it out in order:

freqs = {}
for e in openingElems:
  if freqs.has_key(e): freqs[e] += 1
  else: freqs[e] = 1
freqs = map(lambda (x,y): (y,x), freqs.items())
freqs.sort()
for f in freqs: print f

That, together with a look at the template document, should give you enough pointers to write a quick script that converts between your simplified HTML markup (or the markup that is generated by the something-else-to-HTML-translator utility you are using in your scripts) and the markup of the template. This script can be re-used as many times as is needed during the editing process. It's best if you put the list of things to search and replace into a list of tuples, rather than in the code itself; that way the program can be reversed just by reversing each tuple in the list:

searchList = [
  (r"<p class=Text lots-of-attributes>\(.*\)</p>",
      r"<p>\1</p>"),
  (r"<p class=Bulletedlist
                     lots-of-attributes>\(.*\)</p>",
      r"<ul><li>\1</li></ul>"),
  ...
] 

for s in searchList: text=re.replace(text,s[0],s[1])
# or:
 for s in searchList: text=re.replace(text,s[1],s[0])

Note that we are not using special XML-handling toolkits because it's often quite awkward to get them to work with Word-generated markup; they tend to find errors and throw exceptions, which is normally good but not what we want here. Note also the slight awkwardness in the above example about bulleted lists: it has to assume that each item is its own list. You can do better by writing more code, but you might not want to (it's only for one article after all); you can preprocess your HTML by stripping <UL>s and replacing all <LI>s with <UL><LI> (unless you're also using numbered lists, in which case you need to be more clever).

It would be interesting to see if any readers can further automate the process of creating this script. It would be nice if a program could just look at the template and guess most of the rules automatically.

Finishing Touches

The markup for Word's equations and other special objects is complex and is best left alone; if you need to include such things in your document then you can do so by typesetting them in LaTeX or whatever and including the images (hopefully not too many Word-using editors will want to adjust your equations). Incidentally, Word defaults to setting images at 96 dots per inch, and it's best to use PNG or JPEG formats.

Footnotes can be awkward too, but it's best to write without using too many footnotes anyway.

Clearly there are going to be some quality compromises in using Word as an output format. It simply isn't as good at typesetting as LaTeX is. Hopefully the journal's production editor will use something else for the final copy. However, one thing you can do is to make sure your quotes and em-dashes look nice. You can do this by writing them as Unicode entities. Don't try to be too clever and put in ligatures, unless you understand the rules of when and when not to ligature (or can extract them from TeX), and even then don't do it as you may find the final printed copy has them replaced with strange-looking characters. However, nice quotes and em-dashes are quite simple to achieve.

The LaTeX to HTML translator TTH loses the em-dashes during its translation (it replaces them with hyphens), so you need to replace them with a special token before running TTH and restore them afterwards. The following Unix commands will do this and also deal with the quotes and ellipses:

mv texfile.bbl .bbl; mv texfile.aux .aux; cat     \
  texfile.tex | sed -e 's/--/SSB22ANEMDASH/g' -e  \
  's/-/SSB22ANENDASH/g' | tth | sed -e            \
  's/SSB22ANEMDASH/\&#8213;/g' -e                 \
  's/SSB22ANENDASH/\&#8212;/g' -e                 \
  's/''/\&#8220;/g' -e                            \
  "s/''/\\&#8221;/g" -e 's/'/\&#8216;/g' -e       \
  "s/'/\\&#8217;/g" -e 's/\.\.\./\&#8230;/g'      \
  > texfile.html

Previewing the Result

Finally, when you are developing your document, it may be nice to be able to preview what it will look like in Word, just as it is occasionally useful (or at least satisfying) to preview the PostScript or DVI output when working with LaTeX. This is particularly important if you are aiming for a certain page count. If you are working on Windows then you can tell your script to run Word with the HTML file on the command line, or if you don't have Word then you can download Microsoft's Word Viewer (search www.microsoft.com for Word Viewer) to see what should be exactly how Word will show your document.

Sadly, the Windows emulator WINE (www.winehq.org) doesn't seem to be up to running Word Viewer 2003 yet (it can run Word Viewer 97, but that will not be able to render this HTML properly). You can unpack Word Viewer 2003 with the cabextract utility if WINE fails to run the installer (but make sure you have the latest version of cabextract as old versions tend to corrupt the files) but when you run it, it is liable to complain about calls to unimplemented functions in DLLs. You may be able to work around this by borrowing DLLs from a recent version of Windows, but I don't have a suitable Windows license to try this.

It is possible that the special commercial distribution of Wine called CrossOver Office from www.codeweavers.com will do better. However, at the time of writing, their trial version download form seems to have been non-functional for some time, so I couldn't check this. Another trialware product is TextMaker from www.softmaker.com/english/ tm_en.htm which is more lightweight than OpenOffice.org (useful if you're short on RAM) and often gives a better idea of how Word will display your document, but it is not perfect (it managed to scramble the references in one of my papers) so I wouldn't buy it without checking alternatives. OpenOffice.org 2.0 is due to be released soon, so that might be worth checking too.

Notes:

More fields may be available via dynamicdata ..