How to solve document generation problem

Serengeti

Tech

02.09.2019.

featured image

Microsoft Word is an office tool for editing documents. Besides text, it supports advanced features like inserting tables, pictures, different text colours and fonts.

Word and similar tools become less useful when trying to fill in one template file with multiple different data sets. For example, many different documents where the only difference is the forename and surname. Doing this by hand becomes tedious and time consuming.

Docx document structure

Docx, pptx, xlsx and other Office formats are actually just zip files with Office Open XML documents inside (ECMA-376, ISO/IEC 29500).  OOXML is similar for all office formats, they differ by some XML tags and tag attributes. Spreadsheet and document stuff is mostly different in structure but vector drawing mark-up (PresentationML) is exchangeable without changes.

_rels/ folder contains relations which are used in documents. For example, relations definitions for TEST.xml is contained in _rels/TEST.xml.rels

Relation files basically link IDs with data. Images in documents are inserted by ID instead of duplicating and cluttering XML with raw data embeds.

word/document.xml is the main part of a document. It contains the whole layout and content, except footers and headers.

word/document.xml example

Most document tags contain properties and content. tag (Text run) contains text under a child tag and properties under a (Run properties) tag. Text can be split under multiple run tags, which makes it harder to find and replace template names with actual values.

Example below contains a single $(Value) text element, but the editing software inserted a bookmark tag which in this case represents the cursor position when the document was saved.

single text element

OpenXML SDK

OpenXML SDK is a library for reading, editing and writing Microsoft Office Open XML formats for .NET languages. The library does not offer any editing abstractions. It just maps XML tags to .NET classes and XML tag attributes as .NET class properties. File reading and writing is automatically handled.

For example tag is represent as DocumentFormat.OpenXml.Wordprocessing.Paragraph

Generating documents which are correctly displayed in different office tools (Microsoft Word, WPS Office, ...) is difficult because the library does not offer any error checking, just raw data editing.

On top of that, some tags can be mapped to different OpenXML classes where the only difference are available class attributes. This results in some less strict tools like WPS Office displaying the generated documents correctly, but Microsoft Word simply refuses to open the file.

Wordprocessing.SdtContentRun and Wordprocessing.SdtContentBlock classes are both represented as the XML tag. Which one is the correct one depends on the parent tag.

Document editing/generation example

The service for template processing requires a Plain Text Content Control, it checks the text inside for valid template token structure, finds the appropriate token value and replaces the whole control with either text, image, table or whatever we specified as the value.

Plain Text Content Control is used because it does not split the text, compared to normal paragraph tags. It simplifies searching for template tags.

Plain Text Control

Template values and templates are sent to the processing service as raw files encoded as BASE64. Templates can also be stored somewhere else and linked as either HTTP(S) URLs or SharePoint links.

template values
template values

The service opens Template.docx, recursively searches XML tags for template tags in the valid format and replaces them with values from data.json. It saves the result and sends it back to the client as another BASE64 encoded file.

data.json values
data.json

Saša Barišić, Junior Developer