Lightweight XML Editing in Word 2003

By Evan Lenz
I wrote this article for the O’Reilly Network in September 2004. While it's true we now have Office 2007 and OOXML, structured authoring hasn’t changed much when it comes to Word. This article is more cutting-edge than you might expect.

Did you know that Word documents can be saved in XML format? As of Microsoft Office 2003, the second option in Word's Save As dialog--right under "Word Document (*.doc)"--is "XML Document (*.xml)". This format is Microsoft's own XML vocabulary for Word documents, called WordprocessingML (or sometimes just WordML).

Figure 1

The ability to save Word documents as XML is arguably the most important XML-related feature introduced in Word 2003. But you wouldn't know it from all the hype surrounding Word's new support for customer schemas. When Microsoft announced that Word would let you edit XML documents that conform to your own schema (not just the WordprocessingML one), we were rightly intrigued and even excited. The promise of using the world's most popular word processor to edit, say, DocBook documents was nothing less than astounding, and it caused quite a stir in the XML community.

Hope Deferred

Now that the dust has settled and Office 2003 has been available for almost a year, we've got a clearer picture of reality. While the XML features in Word, Excel, Access, and the new InfoPath application are truly impressive and useful, it's clear that Word 2003 doesn't support arbitrary XML editing. At least it doesn't line up with the picture Microsoft painted originally. For one, the custom schema functionality is available only with Office Professional or the stand-alone Word 2003. More importantly, the features don't live up to the hype. While, strictly speaking, you can edit custom XML in Word, you're limited to using schemas that have a very static, fill-in-the-blanks structure. That means no optional or repeating elements and certainly no mixed content--that is, if you want a minimally user-friendly experience.

Or you could force your users to apply XML elements to portions of their document manually, using the new XML Structure task pane with Show XML Tags turned on. In that case, yes, they could edit arbitrary XML documents, even those with mixed content. And yes, Word will let them know if they've done something invalid (though it won't stop them from doing it). But since the user has to do all the work, and since XML elements cannot be associated with style information, the experience is not close to being user friendly (let alone WYSIWYG).

Or you could try to script in all the user friendliness by hand through the new Document Actions task pane. Of course, you should plan on joining a monastery to learn Smart Document programming and the attendant asceticism you'll need in order to appreciate the usability (or lack thereof) of your efforts' final results. (Tell me again, why are we using Word?)

Or (finally) you could come to terms with the fact that the most important (and robust) XML feature that was introduced in Word 2003 is its capability to save documents in a lossless, well-formed, open XML format called WordprocessingML. Ways to use it for generating, transforming, converting, querying, and otherwise processing Word documents are only starting to be realized. Editing custom XML may not be WordprocessingML's killer app, but it does raise some interesting possibilities that we'll explore here.

A Lightweight XSLT-Based Approach

This article presents a lightweight approach to XML editing in Word. It's "lightweight" in that it ignores all of Word's built-in custom schema functionality. A nice side effect of this approach is that it works in all editions of Word 2003. All you need outside of Word is an XSLT processor. (If you do happen to have the advanced XML functionality, you can make use of Word's bundled XSLT processor, but that's not required.)

This approach to editing will work only when your XML format is isomorphic to the structure and styles of your Word documents. The document's markup will only be as rich as the styles that are applied to it, so this rules out full-on DocBook editing. Word doesn't work well for editing recursive markup structures in general, because it doesn't support recursive styles. Each paragraph has exactly one paragraph style, and each character is associated with exactly one character style. (Word does, however, provide a convenient representation of heading levels as hierarchical subsections, using the <wx:sub-section> element, which we'll see referenced in our example below.)

You can make a complete XML editing solution for Word by writing two XSLT style sheets:

  1. A style sheet to transform from your custom XML to WordprocessingML, and
  2. A style sheet to transform from WordprocessingML back to your custom XML.

The basic scenario goes like this: to edit a custom XML document, it must get transformed by XSLT (No. 1) into WordprocessingML so that a user can edit it in Word. After the user is finished editing the document, the resulting WordprocessingML must be transformed again (No. 2), back to the custom XML format.

Note: This article does not introduce WordprocessingML except by example. For more thorough coverage, refer to the Office 2003 XML sample chapter available online, called "The WordprocessingML Vocabulary".

An Example

Before we look at the XSLT, here's a document that conforms to a dead-simple, DocBook-esque format that we'll be editing:

<?xml version="1.0"?>
<?mso-application progid="Word.Document"?>
<?xml-stylesheet type="text/xsl" href="article2wordml.xsl"?>
<article>
  <title>This is the article title</title>
  <section>
    <title>First section</title>
    <para>This is the <emphasis>first</emphasis> paragraph.</para>
    <para>This is the <strong>second</strong> paragraph.</para>
  </section>
  <section>
    <title>Second section</title>
    <para>This section will have some sub-sections.</para>
    <section>
      <title>First sub-section</title>
      <para>This is the paragraph text of the first sub-section.</para>
    </section>
    <section>
      <title>Second sub-section</title>
      <para>This is the paragraph text of the second sub-section.</para>
      <para>And here is another paragraph, just for the fun of it--with a
            <a href="http://www.xmlportfolio.com/">hyperlink</a> to boot!</para>
    </section>
  </section>
</article>

Here is what we want this document to look like while it's being edited in Word:

Figure 2

As you can see, the XML has a few examples of mixed content, which are rendered in Word using character styles (italic, bold, and blue/underlined). The hierarchical sections of the XML document are rendered using a heading for each title (Heading 1 for the article title, and Heading 2, Heading 3, and so on for successively deep section titles).

The Code

Assuming we have this XML document lying around already and we want to let people edit it, we'll need an XSLT style sheet to transform it to WordprocessingML (style sheet No. 1 in the list above). This file is called article2wordml.xsl. It contains various template rules that map elements from the custom XML to elements and styles defined in WordprocessingML. For example, to turn <emphasis> elements into character runs with the Emphasis character style, we use the following template rule:

<!-- For text in <emphasis>, apply the "Emphasis" character style -->
  <xsl:template match="emphasis/text()">
    <w:r>
      <w:rPr>
        <w:rStyle w:val="Emphasis"/>
      </w:rPr>
      <w:t>
        <xsl:value-of select="."/>
      </w:t>
    </w:r>
  </xsl:template>

To turn section titles into hierarchical headings, we use this template rule:

<!-- Convert section titles to "Heading X" paragraphs -->
  <xsl:template match="section/title">
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Heading{count(ancestor::section)+1}"/>
      </w:pPr>
      <xsl:apply-templates/>
    </w:p>
  </xsl:template>

Once a user has made changes to the document from within Word, a new WordprocessingML document is saved and must be translated back to the custom XML format using style sheet No. 2 mentioned above. This style sheet, called wordml2article.xsl, has similar rules, except that they reflect the reverse mapping--from WordprocessingML to our custom XML format. For example, here's the rule that turns text in the Emphasis style into an <emphasis> element:

<!-- turn a run with the "Emphasis" character style into <emphasis> -->
  <xsl:template match="w:r[w:rPr/w:rStyle/@w:val='Emphasis']"
                mode="para-content">
    <emphasis>
      <xsl:copy-of select="w:t/text()"/>
    </emphasis>
  </xsl:template>

Here are the rules that convert the Heading paragraphs back to sections with titles:

<!-- Convert <wx:sub-section> elements to <section> elements -->
  <xsl:template match="wx:sub-section">
    <section>
      <xsl:apply-templates/>
    </section>
  </xsl:template>

<!-- Convert <w:p> paragraphs to <para> paragraphs -->
  <xsl:template match="w:p">
    <para>
      <xsl:apply-templates mode="para-content"/>
    </para>
  </xsl:template>

<!-- ...except for the first paragraph in a sub-section (Heading 1,2,3,...);
       the heading will be the <title> of the section -->
  <xsl:template match="wx:sub-section/w:p[1]">
    <title>
      <xsl:apply-templates mode="para-content"/>
    </title>
  </xsl:template>

For a complete investigation of the style sheets (including descriptive comments), see the full text of these files:

Formatting Restrictions

Word 2003 also quietly introduces a new feature called formatting restrictions. When you have formatting restrictions enabled, users are restricted to using the set of styles that you specify. They can't modify the styles, nor can they apply direct formatting (such as bold or italic) to their document. While not specifically an XML feature, this enables a sort of document validation that makes particular sense when you are using the lightweight XML editing approach described above. It lets you restrict the range of formatting constructs that your conversion XSLT will have to handle. Rather than writing a generic WordprocessingML transformation, your style sheet will have to handle only those Word documents that are restricted to a particular Word template and its styles. This is a global restriction--a set of allowed styles, as opposed to a content model schema. You can't, for example, enforce that the Emphasis character style be used only in Normal paragraphs. Nevertheless, it is a profoundly useful feature for XML editing applications in Word.

If you look back in article2wordml.xsl, you'll see that formatting restrictions are enabled as a document setting:

  <w:docPr>
    ...
    <w:documentProtection w:formatting="on" w:enforcement="on"/>
  </w:docPr>

The particular styles that are locked or unlocked are indicated as such in the WordprocessingML's global <w:styles> element. In this case, we restrict the users to only the styles they see in the "Styles and Formatting" task pane:

Figure 3

The other built-in styles normally available to Word users appear as if they don't even exist anymore.

Using Word's XSLT Processor

This editing "solution" will work regardless of the edition of Word 2003 you have, provided that you have an external XSLT processor to do the transformations between edits. But if you have Office Professional or the stand-alone Word 2003, then you don't need another XSLT processor; you can use the bundled XSLT processor that comes with those editions of Word. Looking back at our article XML example, we see two processing instructions:

<?mso-application progid="Word.Document"?>
<?xml-stylesheet type="text/xsl" href="article2wordml.xsl"?>

The mso-application processing instruction (PI) associates the XML file with the Word application, so that when a user double-clicks the file, Word opens the XML file, overriding whatever the default XML viewer is on their system. The second PI is useful only if you've got the advanced XML features. Upon opening the file, the user is presented with an option to apply article2wordml.xsl to the document, yielding the editing view we saw above. This is called an onload transformation.

Our other style sheet, wordml2article.xsl, is called an onsave style sheet, as it is applied to the WordprocessingML representation of the edited Word document when the user saves the document after making changes. How does Word know to use this style sheet, you ask? It is referenced inside the WordprocessingML result of the onload transformation. If you look inside article2wordml.xsl, you'll see the relevant document properties being set like so:

      <w:docPr>
        <!-- This only works if you're using Word 2003 standalone or
             Office 2003 Professional -->
        <w:useXSLTWhenSaving/>
        <w:saveThroughXSLT w:xslt="wordml2article.xsl"/>
        ...
      </w:docPr>

The end result is that end users can open, edit, and save the custom XML file without having to invoke any external IT processes. Word handles both XSLT transformations to and from WordprocessingML.

Some Benefits

This approach treats XML editing as essentially a conversion problem. While the activity of conversion isn't the same as that of editing, they're related. If you can create a reasonably reliable transformation from a legacy document format to a desired XML format, then it stands to reason that you could use the same transformation for new Word documents that users create.

A few things can make this easier for the scenario in which authors are creating new documents, as opposed to you converting legacy documents. Before users start authoring documents, you have the freedom to decide what Word template to use, along with the appropriate styles--whereas you don't have that option when converting legacy documents that already exist.

Another advantage of this approach is that it doesn't force the Word user to adopt a new model or way of thinking or editing (which is decidedly not the case if you make them use Word's built-in custom XML features). The savvy Word author doesn't have to know that the document will be converted to XML later on. They just know that using styles is good practice. But even if they don't know that, we can force them (through formatting restrictions) to use the correct styles to get the formatting they want.

Some Limitations

One of the things I like about this "lightweight" approach is that, beyond creating a Word template, the only code you have to write is two XSLT style sheets. It sounds deceptively simple. The problem is that the more complicated your XML formats become, the more difficult it will be to define round-trip mappings between them and WordprocessingML. In the real world, we usually want to support at least some forms of recursive markup. For example, we should be able to specify that some text is "strong" and "emphasized" by using markup like this:

  <strong>This is bold <emphasis>and italic</emphasis>.</strong>

But since Word doesn't support such combinations, you have to merge these into a single style definition, called something like StrongAndEmphasis. And you'll want to also account for the scenario in which a <strong> element appears inside an <emphasis> element, not just the other way around. So we would need to add a rule to our onload style sheet that looks something like this:

  <xsl:template match="strong/emphasis/text() | emphasis/strong/text()"
                priority="1">
    <w:r>
      <w:rPr>
        <w:rStyle w:val="StrongAndEmphasis"/>
      </w:rPr>
      <w:t>
        <xsl:value-of select="."/>
      </w:t>
    </w:r>
  </xsl:template>

The transformation back to the custom XML format is even trickier if we want to avoid flattened markup that looks like this in the result:

  <strong>This is bold </strong>
  <strong><emphasis>and italic</emphasis></strong>
  <strong>.</strong>

That's not to say that your average XSLT wizard won't be able to figure out a solution--maybe even a generic solution. (I can imagine using a two-stage transformation that would allow you to reintroduce a normalized hierarchy into the markup, but that's getting out of scope here.) It's just that it won't be terribly straightforward. Even so, I like the challenge.

Conclusion

The takeaway from this article should not be that Word's custom XML schema features are completely useless. No, they have their uses, particularly if you've got more data-oriented, business-template document formats. The thing to keep in mind is that this is essentially version 1.0 technology. It is exciting, even if it's not ready for prime time in terms of general XML editing. It will definitely be interesting to see what the next version of Word will add in terms of XML support. Until then, you might still be able to employ Word in a robust and usable way for your document-oriented XML applications with a little bit of creativity and XSLT trickery.

Contact: evan@lenzconsulting.com; +1 (206) 898-1654
Copyright © 2023 — Lenz Consulting Group, Inc.