ODFDOM Index > OpenDocument Format and the ODFDOM API

OpenDocument Format and the ODFDOM API

In this series of articles, you will learn how to use the ODFDOM Toolkit API to create documents in OpenDocument format (ODF) and extract information from ODF files.

Inside an OpenDocument File

Before looking at the toolkit, let’s talk about OpenDocument format. The best way to do this is to look at a sample word processing file (see a screenshot). At its heart, an OpenDocument file is a .zip file that contains XML files that describe the document. If you unzip the sample document, you’ll get about a dozen files and directories. Here are the most important ones, and how they relate to your document.

The meta-inf/manifest.xml file

The manifest.xml file is the “table of contents” of the document package. It contains a list of all the directories and files that make up the ODF document, not including itself. The ODF toolkit builds the manifest file for you automatically when you create or modify a document.

The meta.xml file

The meta.xml file contains meta data about the document. It gives you information such as creation date, author, and description. For example, the information shown in this dialog box is reflected in this XML:

dialog box showing title, subject, and keywords
<dc:title>Raccoons</dc:title> <dc:subject>Nature Notes about Raccoons</dc:subject> <meta:keyword>raccoon</meta:keyword> <meta:keyword>mammal</meta:keyword> <meta:keyword>procyonidae</meta:keyword>

The styles.xml File

The styles.xml file contains information about named styles; the ones you see in a dialog box in your word processor. The following screenshot shows this dialog along with the two headings from the sample document, which are in Heading 1 and Lined Title styles.

pull-down dialog showing 'lined title' highlighted Section of document showing two headings

Here is the XML for the Lined Title style, edited for purposes of brevity.

<style:style style:family="paragraph" style:name="Lined_20_Title" style:display-name="Lined Title" style:parent-style-name="Standard" style:class="text"> <style:paragraph-properties fo:text-align="end" fo:border-bottom="0.0008in solid #000000"/> <style:text-properties style:font-name="Arial" fo:font-size="14pt" fo:font-style="italic" style:font-size-asian="14pt" style:font-style-asian="italic" style:font-size-complex="14pt" style:font-style-complex="italic"/> </style:style>

Every style belongs to a family. The family tells what kind of element this style is applied to. Styles for paragraphs or headings belong to the Paragraph family; styles for inline text belong to the Text family.

The style also has a style:name attribute for internal use and a style:display-name attribute to show to the user. The convention is to change blanks in the display name to _20_ in the internal style name.

Style Properties

Within the style object are the style properties that describe what the style looks like. These properties come in property sets, and a style can have properties from more than one set. The Lined Title style has paragraph properties to set the text alignment and borders; its text properties, which apply to individual characters, specify the font name, font size, and font style.

Automatic Styles

The styles.xml file also contains automatic styles. These styles have an internal style name, but are not accessible to users by name. For exampe, when you specify the date field at the lower right of the page, its style is automatically created with the rather prosaic name of N37; here is its XML.

<number:date-style style:name="N37" number:automatic-order="true"> <number:month number:style="long"/> <number:text>/</number:text> <number:day number:style="long"/> <number:text>/</number:text> <number:year/> </number:date-style>

The style for the page layout is also in the automatic styles.

The content.xml File

The text of those two headings (and the blank line between them) goes into the content.xml file with the following XML; <text:h> specifies a heading and <text:p> a paragraph.

<text:h text:style-name="Heading_20_1" text:outline-level="1">Raccoons</text:h> <text:p text:style-name="Standard"/> <text:p text:style-name="Lined_20_Title">Nature Notes № 1</text:p>

Styles in Content

When you click the “bold” or “italic” icons in the word processor, the program will produce an automatic style for the selected text. These automatic styles go into the content.xml file; they have the same format as the styles in the styles.xml file. Here is the relevant XML for the styles and the text that uses them:

text with bold and italic words
<style:style style:name="T2" style:family="text"> <style:text-properties fo:font-weight="bold"/> </style:style> <style:style style:name="T4" style:family="text"> <style:text-properties fo:font-style="italic"/> </style:style>
<text:p text:style-name="Standard">Raccoons, those mysterious masked mammals, are members of the <text:span text:style-name="T2">procyonidae</text:span> family. The scientific name for the raccoon is <text:span text:style-name="T4">Procyon lotor</text:span>

Images in Documents

Images are referred to by the <draw:image> element, and the image itself is placed in the Pictures folder. The ODF Toolkit lets you insert both the image and the XML with a call to a single method.

Tables

Tables are specified in the content file with a <table:table> element. The table starts by specifying each <table:table-column>’s style. The column specifications are followed by <table:table-row> elements. Each row contains <table:table-cell> elements, which in turn contain the cell content. In the case of this table, each cell contains a <text:p> element for the number of raccoon sightings.

Table showing raccon sightings per month

Embedded Objects

The chart is considered to be an object; it is referenced by a <draw:object> element that has an xlink:href attribute that refers to the Object 1 subdirectory. That directory contains its own content.xml and styles.xml files.

The ODFDOM Toolkit

I could go on at great length about the XML that makes up an OpenDocument file, but let’s face it—nobody wants to work at the level of .zip files and raw XML. Instead, use the ODFDOM Toolkit, a set of Java classes that makes creating and modifying documents much easier.

The ODFDOM classes allow you to work with a document at three levels, as described on the project’s site:

The ODF Package / Physical Layer
Methods in this layer provide direct access to the resources stored in the ODF package, such as XML streams, images or embedded objects.
The ODF XML Layer: low-level DOM API
This layer provides a class for every ODF XML element defined by the ODF specification and its grammar (the RelaxNG schema). The classes are generated directly from the ODF grammar, thus guaranteeing complete and accurate coverage of the ODF specification.
The ODF XML Layer: high-level Document API
Provides a much more high level view on the ODF schema features. It hides all ODF XML implementation details, covering frequent user scenarios.

What’s Next?

In the next article, we will use the ODFDOM classes to convert an XML data file of information about movies to an OpenDocument text (word processing) file.