[Part 4 of 9] One of the earliest and most pressing questions that the project team had to answer in constructing The Pauling Catalogue was how to go about formatting the text of such a massive document. The catalogue had been generated over many years as a series of WordPerfect word processing documents. While the word processing interface worked nicely in developing working documents, moving the catalogue data out of WordPerfect and into a flexible format more suitable to a professional printing operation was a significant challenge.
Ultimately it was decided to format the text data using Extensible Markup Language (XML) and Extensible Stylesheet Language Transformations (XSLT).
XML is an encoding schema that adds machine-readable value to existing data. Using a series of tags applied hierarchically throughout a given data set, XML greatly enhances one’s ability to manipulate data in useful, uniform ways. This manipulation of XML-encoded data is implemented using XSLT. In a nutshell, XSL transformations consist of sets of rules that locate specific pieces of data and then either order the data pieces in a certain prescribed way or hide the data pieces entirely.
The seventeen series that make up The Pauling Catalogue were each encoded in XML and manipulated – sometimes subtly and sometimes severely – using XSLT. The importance of this process to the creation of the end product is difficult to overstate. A perfect illustration of the power of XML and XSLT is provided by the Pauling Personal Library series in Volume 6 of The Pauling Catalogue. Linus and Ava Helen Pauling’s personal library contains over 4,000 volumes and the published bibliography of all these items is 178 pages long. The XML mark-up for each book is shown here:
When the personal library was originally encoded for display on the web, all of the volumes that make up the series were arranged according to Library of Congress classification number. As the details of The Pauling Catalogue publication were being determined, a decision was made that the books in the Personal Library would be more useful to users of a paper reference if each item were presented alphabetically by authors’ last name.
Carrying out this re-sort process by hand would have taken a very long time, as each book listing would require human “cutting and pasting” intervention to reorganize the records from call number order to alphabetical order. However, because the content of the personal library had been described in XML, which is a machine-readable format, a series of new XSLT rules were instead utilized to automate the re-sort:
Consequently, a process that would have taken many days, if not weeks, to conduct “by hand,” was instead completed with a few hours of nimble XSLT coding. The resulting differences from the version 2 proof to the version 6 proof of The Pauling Catalogue are immediately apparent:
Another major benefit of XML is the standard’s support for special characters. When developing content in HTML, web authors have traditionally been required to describe special characters (e.g. scientific symbols or non-Roman alphabetic characters) using character entities.
For example, if one wished to insert a subscript number 2 into their text, HTML would require that the author use the character entity $#8322; to display the symbol in a web browser. XML, on the other hand, uses tags that are both human- and machine-readable to describe and format a subscript 2. (see illustration below)
The situation is similar for symbols such as an arrow: HTML requires the character entity $#8594; while XML “understands” and will output an arrow symbol entered into a properly-formed XML document. This enhanced support of special characters encoding was terrifically helpful in the formatting of the Pauling Research Notebooks series, which contains a great number and variety of special characters:
While XML and XSLT provided a strong platform for the formatting of The Pauling Catalogue text, the 1,200+ illustrations inserted throughout the six-volume publication presented a new and varied set of challenges. The processes required to cope with these issues will be the subject of our next post in this series.
The Pauling Catalogue is available for purchase at http://paulingcatalogue.org