Creating The Pauling Catalogue: Formatting Text with XML and XSL

The text formatting cycle used in the creation of The Pauling Catalogue

A depiction of the text formatting cycle used in the creation of The Pauling Catalogue

[Part 4 of 9] One of the earliest and most pressing questions that the project team had to answer in constructing The Pauling Catalogue was how to go about formatting the text of such a massive document. The catalogue had been generated over many years as a series of WordPerfect word processing documents. While the word processing interface worked nicely in developing working documents, moving the catalogue data out of WordPerfect and into a flexible format more suitable to a professional printing operation was a significant challenge.

Ultimately it was decided to format the text data using Extensible Markup Language (XML) and Extensible Stylesheet Language Transformations (XSLT).

XML is an encoding schema that adds machine-readable value to existing data. Using a series of tags applied hierarchically throughout a given data set, XML greatly enhances one’s ability to manipulate data in useful, uniform ways. This manipulation of XML-encoded data is implemented using XSLT. In a nutshell, XSL transformations consist of sets of rules that locate specific pieces of data and then either order the data pieces in a certain prescribed way or hide the data pieces entirely.

The seventeen series that make up The Pauling Catalogue were each encoded in XML and manipulated – sometimes subtly and sometimes severely – using XSLT. The importance of this process to the creation of the end product is difficult to overstate. A perfect illustration of the power of XML and XSLT is provided by the Pauling Personal Library series in Volume 6 of The Pauling Catalogue.  Linus and Ava Helen Pauling’s personal library contains over 4,000 volumes and the published bibliography of all these items is 178 pages long. The XML mark-up for each book is shown here:

The XML encoding schema for one of the 4,000+ books in the Pauling Personal Library

The XML encoding schema for one of the 4,000+ books in the Pauling Personal Library

When the personal library was originally encoded for display on the web, all of the volumes that make up the series were arranged according to Library of Congress classification number.  As the details of The Pauling Catalogue publication were being determined, a decision was made that the books in the Personal Library would be more useful to users of a paper reference if each item were presented alphabetically by authors’ last name.

Carrying out this re-sort process by hand would have taken a very long time, as each book listing would require human “cutting and pasting” intervention to reorganize the records from call number order to alphabetical order.  However, because the content of the personal library had been described in XML, which is a machine-readable format, a series of new XSLT rules were instead utilized to automate the re-sort:

A series of rules written in XSL was used to re-sort the Pauling Personal Library arrangement.

A series of rules written in XSL was used to re-sort the Pauling Personal Library arrangement.

Consequently, a process that would have taken many days, if not weeks, to conduct “by hand,” was instead completed with a few hours of nimble XSLT coding.  The resulting differences from the version 2 proof to the version 6 proof of The Pauling Catalogue are immediately apparent:

From working version 2 to working version 6 of the publication, significant arrangement changes were made to the Pauling Personal Library

From draft version 2 to draft version 6 of the publication, significant arrangement changes were made to the Pauling Personal Library

Another major benefit of XML is the standard’s support for special characters.  When developing content in HTML, web authors have traditionally been required to describe special characters (e.g. scientific symbols or non-Roman alphabetic characters) using character entities.

For example, if one wished to insert a subscript number 2 into their text, HTML would require that the author use the character entity $#8322; to display the symbol in a web browser.  XML, on the other hand, uses tags that are both human- and machine-readable to describe and format a subscript 2. (see illustration below)

The situation is similar for symbols such as an arrow:  HTML requires the character entity $#8594; while XML “understands” and will output an arrow symbol entered into a properly-formed XML document.  This enhanced support of special characters encoding was terrifically helpful in the formatting of the Pauling Research Notebooks series, which contains a great number and variety of special characters:

An example of the special characters encoding used in the Pauling Research Notebooks series.  XML's support of special characters encoding is significantly more intuitive and elegant than the character entity requirements specified by html.

An example of the special characters mark-up used in the Pauling Research Notebooks series. XML's support of special characters encoding is significantly more intuitive and elegant than are the character entity requirements specified by HTML.

The Pauling Catalogue

The Pauling Catalogue

While XML and XSLT provided a strong platform for the formatting of The Pauling Catalogue text, the 1,200+ illustrations inserted throughout the six-volume publication presented a new and varied set of challenges.  The processes required to cope with these issues will be the subject of our next post in this series.

The Pauling Catalogue is available for purchase at http://paulingcatalogue.org

4 Responses

  1. Your blog is interesting!

    Keep up the good work!

  2. […] certain scientific symbols and non-Roman alphabetic characters) for use throughout the project. As mentioned earlier in this series, coping with the challenges presented by special characters was in part enabled by the use of […]

  3. […] talk more about the XSL side of things. (For some introductory information on XML and XSL, see this post, which discusses our use of these tools in creating The Pauling […]

  4. […] Creating the Pauling Catalogue: Formatting Text with XML and XSL [8-12-08; 189] […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: