    <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/">
     <channel>
        <title>ACCU  :: Automatically Generating Word Documents</title>
        <link>https://members.accu.org/index.php/articles/794</link>
        <description>Professionalism in Programming</description>
        <dc:language>en-us</dc:language> 
        <dc:creator>Administrator</dc:creator> 
        <admin:generatorAgent rdf:resource="http://www.xaraya.org" /> 
        <admin:errorReportsTo rdf:resource="mailto:webeditor@accu.org" />
       <sy:updatePeriod>hourly</sy:updatePeriod>
       <sy:updateFrequency>1</sy:updateFrequency>
       <docs>http://backend.userland.com/rss</docs>




<div class="xar-mod-head"><span class="xar-mod-title">Project Management + CVu Journal Vol 17, #2 - Apr 2005</span></div>

<table border="0" cellpadding="1" cellspacing="0">
    <tbody>
    <tr>
        <td valign="top">
            Browse in :
       </td>
       <td valign="top">

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c13/">Topics</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c66/">Management</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c76/">Journals</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c77/">CVu</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c97/">172</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/c66-97/">Any of these categories</a>

                    -                        <a href="https://members.accu.org/index.php/articles/c66+97/">All of these categories</a>
<br />
</td>
   </tr>
   </tbody>
</table>




<div class="xar-error">
   <p>
 <strong>Note:</strong> when you create a new publication type,
the articles module will automatically use the templates
<em>user-display-[publicationtype].xt</em>
and <em>user-summary-[publicationtype].xt</em>.
If those templates do not exist when you try to preview or display a new article,
you'll get this warning :-)  Please place your own templates in themes/<em>yourtheme</em>/modules/articles . The templates will get the extension .xt there. </p>
</div>
<div class="xar-norm xar-standard-box-padding">
   <h1><strong>Title:</strong>&nbsp;Automatically Generating Word Documents</h1>
<p><strong>Author:</strong>&nbsp;</p>
<p>
<strong>Date:</strong> 03 April 2005 13:16:11 +01:00 or Sun, 03 April 2005 13:16:11 +01:00</p>
<p><strong>Summary:</strong>&nbsp;</p>
<p><strong>Body:</strong>&nbsp;<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e20" id="d0e20"></a></h2>
</div>
<p>I like to write my documents without WYSIWYG (what you see is
what you get). One reason for this is that I want my mind to
concentrate on what the document actually says, rather than on
trivial details about what it looks like.</p>
<p>There are good typesetting systems, such as LaTeX, that can make
the text look nice by putting the line breaks in the best places,
putting the right amount of space between the words, and so on, so
why should I let such things distract me when I'm trying to write
creatively?</p>
<p>Another reason for not using WYSIWYG is that a document becomes
more like a program. If you are editing markup, you are editing the
&quot;source code&quot; which can then be &quot;run&quot; in various ways. You might
transform your markup (using various tools) into different formats,
such as an HTML version for online viewing and a TeX version for
printing.</p>
<p>If you later find something wrong then you can change the source
code and all versions will be changed automatically when you run
your script to re-generate them. You can also use the features of
your favourite editor, not to mention obtaining the assistance of
your filing system by writing the different sections in files and
directories and getting the script to merge them when generating
the output (and the order in which they are included can be changed
quickly).</p>
<p>Your script might also include some code to write certain parts
of the document automatically, for example, test results can be
generated on-the-fly by running the test program (and any changes
to that program will automatically be reflected in the document).
In short, you can relax without having to worry about keeping
everything consistent and up-to-date, because the computer does all
that for you (well, most of it anyway).</p>
<p>But there's one problem with this approach. What if you're
writing a technical paper for publication, and the editor says,
&quot;you must submit it in Microsoft Word format&quot;? There are many ways
of converting other formats into Word, but usually the editor will
go further and say &quot;and it must look like this example&quot; or &quot;it must
use that template&quot;. This means that, after you've got the document
into Word format (by converting from HTML or whatever), you usually
have to do some further work inside Word to make it look like what
the editor wants. And then if you later need to make more changes,
you either have to make all future changes inside Word (which means
sacrificing all the benefits of the source-code approach, not to
mention being more restricted in your choice of operating system
and working environment, and causing problems if you want to
publish in other forms and Word's conversion is not nice enough),
or you have to repeat the whole process of getting it into Word and
making the adjustments all over again as many times as is needed.
If any of this annoys you, then that may negatively affect the
quality of your work, so it's worth doing something about it.</p>
<p>One option is to become an expert in Microsoft Word's macro
system and use that, either to automate the process of getting your
document into Word, or, if you're a real macro expert, use it to
drive the whole approach instead of using the scripting language of
your choice. However, this does have its disadvantages. You have to
be very good with the macro system in order to make your macros
robust (it's too easy to make some small change that accidentally
means the macro won't work any more), and if your only reason for
using Word is to fulfill the requirements of some editor, why
bother to learn Microsoft's product-specific language if there's
some way of doing the same job using Python or whatever other
general scripting language you're already skilled at?</p>
<p>The approach that I eventually found was this. Modern versions
of Microsoft Word have an interesting way of saving documents as
HTML. Enough information is encoded in the HTML for Word to re-read
the document exactly in nearly all cases (if there is something
that cannot be saved in the HTML then Word should tell you what
that is at the time of the save). Now, there are various views
about the quality of Word's HTML in terms of Web standards and
portability, but if you regard Word's version of HTML as an
alternative file format for Word documents, there is hope for your
automatic program. Simply save the editor's example as HTML, have
your program generate HTML like that example, and import this into
Word whenever you need to produce a Word version; no post-editing
in Word should be necessary. You may even be able to use diff and
other utilities to find how the Word document has been changed by
others (if they haven't tracked the changes) by converting it back
to HTML and comparing that with what you sent.</p>
<p>I wasn't able to test enough versions of Word to work out
exactly when this feature was introduced; the HTML functionality of
Word 97 does NOT preserve all formatting, but Word 2002 does. If
you do have access to a real copy of Microsoft Word, as opposed to
something like OpenOffice.org, then it's best to use the real
Microsoft product in order to maximise your chances that the
conversion will go without a hitch, especially when converting back
to Word format. If you don't have access to Word then you might be
able to get away with asking the person who wants the Word document
to save their example as HTML and import your HTML reply, although
this requires more skill on their part especially if you have
separate image files.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e40" id="d0e40"></a>Reducing the
Clutter</h2>
</div>
<p>Word's HTML may be verbose, but it's not difficult to understand
which part does what (especially with syntax highlighting), and
it's much easier to work with Word's HTML than it is to work with
its RTF (how many text editors with special facilities for
navigating around RTF source do you know?) Modern versions of Word
store the document's styles in the HTML header, using the CLASS
attribute to assign HTML elements to styles and using stylesheet
overrides to show other formatting.</p>
<p>The first thing to realise is that the only markup that matters
for our purposes is the body section of the document. Everything up
to and including the line beginning <tt class=
"literal">&lt;body</tt> can be included without change at the start
of the output document. You can do this in Python as follows:</p>
<pre class="programlisting">
template = open(template_html_filename).read()
text = open(my_html_filename).read()
out = open(output_filename, &quot;w&quot;)
out.write(template[:re.match(r&quot;&lt;body[^&gt;]*&gt;&quot;,
                                   template).end()])
out.write(text[re.match(r&quot;&lt;body[^&gt;]*&gt;&quot;,text).end():])
</pre>
<p>which means you can edit (or automatically generate) your HTML
without worrying about what comes before the <tt class=
"literal">&lt;body&gt;</tt> tag. You could adapt this so as to
paste in text up to any point in the document you like by inserting
special keywords to indicate that point in both the template and
your document; this might be useful if the editor requires a
complex header (name, address, etc, in an unusual format) that is
hard to convert but that will rarely (if ever) have to change
during the editing process.</p>
<p>The remaining &quot;clutter&quot; in the template may include
backward-compatibility markup (you can safely replace the regular
expression <tt class="literal">&lt;!\[if
!support.*!\[endif\]&gt;</tt> with nothing), and markup for smart
tags and spelling and grammar alerts (you can also safely remove
this, but it's easier if you can tell Word not to save it in the
first place by unchecking &quot;embed smart tags&quot; and &quot;embed linguistic
data&quot; in the &quot;save options&quot; dialogue and turning off spelling and
grammar check in the &quot;language&quot; dialogue).</p>
<p>The template you have to work with is still a little too
cluttered though. There is usually a gratuitous amount of markup to
represent the skips between paragraphs, and in some documents you
will find rather a lot of markup that overrides the language of
each part of the text. If you use search-and-replace to simplify
this, you can then edit (or automatically generate) the simplified
version and then reverse all your searches and replaces to get back
to Word's version. That can all be scripted.</p>
<p>It may help if you are working with no end-of-line markers but
with end-of-paragraph markers. You can get the text into that state
by using code such as:</p>
<pre class="programlisting">
paragraph_token = &quot;-- PARAGRAPH&quot;+chr(0)
template = template.replace(&quot;\n\n&quot;, paragraph_token) \
.replace(&quot;\n&quot;,&quot; &quot;) .replace(paragraph_token, &quot;\n\n&quot;)
</pre>
<p>When you are writing your search-and-replace list, one thing
that helps to begin with is to see a list of the most common
markup. You can do this as follows. First, get a list of all the
tags that open elements:</p>
<pre class="programlisting">
openingElems = re.findall(&quot;&lt;[^/][^&gt;]*&gt;&quot;, template)
</pre>
<p>then generate a frequency count and print it out in order:</p>
<pre class="programlisting">
freqs = {}
for e in openingElems:
  if freqs.has_key(e): freqs[e] += 1
  else: freqs[e] = 1
freqs = map(lambda (x,y): (y,x), freqs.items())
freqs.sort()
for f in freqs: print f
</pre>
<p>That, together with a look at the template document, should give
you enough pointers to write a quick script that converts between
your simplified HTML markup (or the markup that is generated by the
something-else-to-HTML-translator utility you are using in your
scripts) and the markup of the template. This script can be re-used
as many times as is needed during the editing process. It's best if
you put the list of things to search and replace into a list of
tuples, rather than in the code itself; that way the program can be
reversed just by reversing each tuple in the list:</p>
<pre class="programlisting">
searchList = [
  (r&quot;&lt;p class=Text lots-of-attributes&gt;\(.*\)&lt;/p&gt;&quot;,
      r&quot;&lt;p&gt;\1&lt;/p&gt;&quot;),
  (r&quot;&lt;p class=Bulletedlist
                     lots-of-attributes&gt;\(.*\)&lt;/p&gt;&quot;,
      r&quot;&lt;ul&gt;&lt;li&gt;\1&lt;/li&gt;&lt;/ul&gt;&quot;),
  ...
] 

for s in searchList: text=re.replace(text,s[0],s[1])
# or:
 for s in searchList: text=re.replace(text,s[1],s[0])
</pre>
<p>Note that we are not using special XML-handling toolkits because
it's often quite awkward to get them to work with Word-generated
markup; they tend to find errors and throw exceptions, which is
normally good but not what we want here. Note also the slight
awkwardness in the above example about bulleted lists: it has to
assume that each item is its own list. You can do better by writing
more code, but you might not want to (it's only for one article
after all); you can preprocess your HTML by stripping <tt class=
"literal">&lt;UL&gt;</tt>s and replacing all <tt class=
"literal">&lt;LI&gt;</tt>s with <tt class=
"literal">&lt;UL&gt;&lt;LI&gt;</tt> (unless you're also using
numbered lists, in which case you need to be more clever).</p>
<p>It would be interesting to see if any readers can further
automate the process of creating this script. It would be nice if a
program could just look at the template and guess most of the rules
automatically.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e93" id="d0e93"></a>Finishing
Touches</h2>
</div>
<p>The markup for Word's equations and other special objects is
complex and is best left alone; if you need to include such things
in your document then you can do so by typesetting them in LaTeX or
whatever and including the images (hopefully not too many
Word-using editors will want to adjust your equations).
Incidentally, Word defaults to setting images at 96 dots per inch,
and it's best to use PNG or JPEG formats.</p>
<p>Footnotes can be awkward too, but it's best to write without
using too many footnotes anyway.</p>
<p>Clearly there are going to be some quality compromises in using
Word as an output format. It simply isn't as good at typesetting as
LaTeX is. Hopefully the journal's production editor will use
something else for the final copy. However, one thing you can do is
to make sure your quotes and em-dashes look nice. You can do this
by writing them as Unicode entities. Don't try to be too clever and
put in ligatures, unless you understand the rules of when and when
not to ligature (or can extract them from TeX), and even then don't
do it as you may find the final printed copy has them replaced with
strange-looking characters. However, nice quotes and em-dashes are
quite simple to achieve.</p>
<p>The LaTeX to HTML translator TTH loses the em-dashes during its
translation (it replaces them with hyphens), so you need to replace
them with a special token before running TTH and restore them
afterwards. The following Unix commands will do this and also deal
with the quotes and ellipses:</p>
<pre class="programlisting">
mv texfile.bbl .bbl; mv texfile.aux .aux; cat     \
  texfile.tex | sed -e 's/--/SSB22ANEMDASH/g' -e  \
  's/-/SSB22ANENDASH/g' | tth | sed -e            \
  's/SSB22ANEMDASH/\&amp;#8213;/g' -e                 \
  's/SSB22ANENDASH/\&amp;#8212;/g' -e                 \
  's/''/\&amp;#8220;/g' -e                            \
  &quot;s/''/\\&amp;#8221;/g&quot; -e 's/'/\&amp;#8216;/g' -e       \
  &quot;s/'/\\&amp;#8217;/g&quot; -e 's/\.\.\./\&amp;#8230;/g'      \
  &gt; texfile.html
</pre></div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e106" id="d0e106"></a>Previewing
the Result</h2>
</div>
<p>Finally, when you are developing your document, it may be nice
to be able to preview what it will look like in Word, just as it is
occasionally useful (or at least satisfying) to preview the
PostScript or DVI output when working with LaTeX. This is
particularly important if you are aiming for a certain page count.
If you are working on Windows then you can tell your script to run
Word with the HTML file on the command line, or if you don't have
Word then you can download Microsoft's Word Viewer (search <a href=
"http://www.microsoft.com" target="_top">www.microsoft.com</a> for
Word Viewer) to see what should be exactly how Word will show your
document.</p>
<p>Sadly, the Windows emulator WINE (<a href=
"http://www.winehq.org" target="_top">www.winehq.org</a>) doesn't
seem to be up to running Word Viewer 2003 yet (it can run Word
Viewer 97, but that will not be able to render this HTML properly).
You can unpack Word Viewer 2003 with the <tt class=
"literal">cabextract</tt> utility if WINE fails to run the
installer (but make sure you have the latest version of <tt class=
"literal">cabextract</tt> as old versions tend to corrupt the
files) but when you run it, it is liable to complain about calls to
unimplemented functions in DLLs. You may be able to work around
this by borrowing DLLs from a recent version of Windows, but I
don't have a suitable Windows license to try this.</p>
<p>It is possible that the special commercial distribution of Wine
called CrossOver Office from <a href="http://www.codeweavers.com"
target="_top">www.codeweavers.com</a> will do better. However, at
the time of writing, their trial version download form seems to
have been non-functional for some time, so I couldn't check this.
Another trialware product is TextMaker from <a href=
"http://www.softmaker.com/english/%20tm_en.htm" target=
"_top">www.softmaker.com/english/ tm_en.htm</a> which is more
lightweight than OpenOffice.org (useful if you're short on RAM) and
often gives a better idea of how Word will display your document,
but it is not perfect (it managed to scramble the references in one
of my papers) so I wouldn't buy it without checking alternatives.
OpenOffice.org 2.0 is due to be released soon, so that might be
worth checking too.</p>
</div>
</p>
<p><strong>Notes:</strong>&nbsp;</p>
<p><em>More fields may be available via dynamicdata ..</em></p>
</div>
</channel>
</rss>
