Any given page in the Linus Pauling Day-by-Day calendar is the product of up to four different XML records. These records describe the various bits of data that comprise the project – be they document summaries, images or full-text transcripts. The data contained in the various XML records are then interpreted by XSL stylesheets, which redistribute the information and generate local HTML files as their output. The local HTML is, in turn, styled using CSS and then uploaded to the web.
Got all that?
In reality, the process is not quite as complicated as it may seem. Today’s post is devoted to describing the four XML components that serve as building blocks for the calendar project. Later on this week, we’ll talk more about the XSL side of things. (For some introductory information on XML and XSL, see this post, which discusses our use of these tools in creating The Pauling Catalogue)
Preliminary work in WordPerfect
The 68,000+ document summaries that comprise the meat of Linus Pauling Day-by-Day have been compiled by dozens of student assistants over the past ten years. Typically, each student has been assigned a small portion of the collection and is charged with summarizing, in two or three sentences, each document contained in their assigned area. These summaries have, to date, been written in a series of WordPerfect documents.
The January 30, 2009 launch of Linus Pauling Day-by-Day is being referred to, internally, as Calendar 1.5. This is in part because of several major workflow changes that we have on tap for future calendar work, a big part of it being a movement out of WordPerfect. While the word processing approach has worked pretty well for our students – it’s an interface with which they’re familiar, and includes all the usual advantages of a word processing application (spellchecking, etc.) – it does present fairly substantial complications for later stages of the work flow.
For one, everything has to be exported out of the word processing documents and marked-up in XML by hand. For another, the movement out of a word processor and into xml often carries with it issues related to special characters, especially “&,” “smart quotes” and “em dash,” all of which can play havoc with certain xml applications.
Our plan for Calendar 2.0 is to move out of a word processing interface for the initial data entry in favor of an XForms interface, but that’s fodder for a later post.
The “Year XML”
Once a complete set of data has been compiled in WordPerfect, the content is then moved into XML. All of the event summaries that our students write are contained in what might be called “Year XML” records. An example of the types of data that are contained in these XML files is illustrated here in fig. 2. Note that the information in the fig. 2 slide is truncated for display purposes – all of the “—-” markers represent text that has been removed for sake of scaling the illustration – but that generally speaking, the slide refers to the contents of the January 2, 1940 and August 7, 1940 Day-by-Day pages, the latter of which will also serve as our default illustrations reference.
Cursory inspection of the “Year XML” slide reveals one of the mark-up language’s key strengths – it’s simplicity. For the most part, all of the tags used are easily-understandable to humans and the tag hierarchy that organizes the information follows a rather elementary logic. The type of record is identified using <calendar>, months are tagged as <month>, days are tagged as <day> and document summaries are tagged as <event>.
The one semi-befuddling aspect of the “Year XML” syntax is the i.d. system used in reference to illustrations and transcripts. After much experimentation, we have developed an internal naming system that works pretty well in assigning unique identifiers to every item in the Pauling archive. The system is primarily based upon a document’s Pauling Catalogue series location and folder identifier, although since the Catalogue is not an item-level inventory (not completely, anyway) many items require further description in their identifier. In the most common case of letters, the further description includes identifying the correspondents and including a date.
Fig. 2 provides an example of three identifiers. The first is <record><id series=”09″>1940i.38</id></record>, which is the “Snapshot” reference for the 1940 index page. This identifier is relatively simple as it defines a photograph contained in the Pauling Catalogue Photographs and Images series (series 09), the entirety of which is described on the item level. So this XML identifier utilizes only a series notation (“09”) and a Pauling Catalogue notation (1940i.38).
The two other examples in Fig. 2 are both letters. The first is <record><id series=”01″>corr136.9-lp-giguere-19400102</id></record>, a letter from Linus Pauling to Paul Giguere located in the Pauling Catalogue Correspondence series (series 01) in folder 136.9, and used on Day-by-Day as the illustration for the first week of January, 1940. Because the folder is not further described on the item level, there exists a need for more explication in the identifier of this digital object. Hence the listing of the correspondents involved and the date on which the letter was written.
The second example is a similar case: <record><id series=”11″>sci14.038.9-lp-weaver-19400807</id></record>, used as the Day-by-Day illustration for the first full week of August 1940. In this instance, however, the original document is held in Pauling Catalogue series 11, Science, and is a letter written by Pauling to Warren Weaver on August 7, 1940.
METS Records to Power the Illustrations
We’ve talked about METS records a few times in the past, and have defined them as “all-in-one containers for digital objects.” The Pauling to Weaver illustration mentioned above is a good example of this crucial piece of functionality, in that it is used as a week illustration in the August 1940 component of the Day-by-Day project, and is also a supporting document on the “It’s in the Blood!” documentary history website. Despite its dual use, the original document was only ever scanned once and described once in METS and MODS. Once an item is properly encoded in a METS record, it becomes instantly available for repurposing throughout our web presence.
Just about everything that we need to know about a scanned document is contained in its METS record. In the case of Day-by-Day, we can see how various components of the Pauling to Weaver METS record are extracted to display on two different pages of the project. Fig. 3 is a screenshot of this page, the “Week Index View” for the August 7, 1940 Day-by-Day page (all of the days for this given week will display the same illustration, but will obviously feature different events and transcripts listings). Fig. 4 is a screenshot of the “Full Illustration View,” wherein the user has clicked on the Week Index illustration and gained access to both pages of the letter, as well as a more-detailed description of its contents.
Below (fig. 5) is an annotated version of the full METS record for the Pauling to Weaver letter. As you’ll note once you click on it, fig. 5 is huge, but it’s worth a look in that, among other details, it gives an indication of how different components of the record are distributed to different pages. For example:
- The Object Title, “Letter from Linus Pauling to Warren Weaver,” which displays in both views.
- The Object Summary, “Pauling requests that Max Perutz…,” which displays only in the Full Illustration View.
- The Object Date, used in both views.
- The Local Object Identifier, sci14.038.9-lp-weaver-19400804, which displays at the bottom of the Full Illustration View.
- The Page Count, used only in the Week Index View.
- Crucially, the 400 pixel-width jpeg images, which are stored in one location on our web server (corresponding, again, with Pauling Catalogue series location), but in this example retrieved for display only in the Week Index View.
- And likewise, the 600 pixel-width jpeg images, which are retrieved for Day-by-Day display in the Full Illustration View, but also used for reference display in the Documentary History projects.
An additional word about the illustrations used in Linus Pauling Day-by-Day
One of the major new components of the “1.5 Calendar” launch is full-page support for illustrations of ten pages or less – in the 1.0 version of the project, only the first page of each illustration was displayed, no matter the length of the original document. Obviously this is a huge upgrade in the amount and quality of the content that we are able to provide from within the calendar. The question begs to be asked, however, “why ten pages or less?”
In truth, the ten pages rule is somewhat arbitrary, but it works pretty well in coping with a major scaling problem that we face with the Day-by-Day project. Users will note that the “Full Illustration View” for all Day-by-Day objects presents the full page content (when available) on a single html page, as opposed to the cleaner interface used on our Documentary History sites. There’s a good reason for this. In the instance of the Documentary History interface, essentially two html pages are generated for every original page of a document used as an illustration: a reference view and a large view. This approach works well for the Documentary History application, in that even very large objects, such as Pauling’s 199-page long Berkeley Lectures manuscript, can be placed on the web without the size of a project exploding out of control – the Berkeley Lectures comprise 398 html pages, which is a lot, but still doable.
Linus Pauling Day-by-Day, on the other hand, currently requires that the full complement of images theoretically comprising an illustration be used, specifically, for each unique day of the week for which an image is chosen. In other words, if the Berkeley Lectures were chosen to illustrate a week within the calendar, and the full content of the digital object were to be displayed for each day of that week using the same clean interface as a Documentary History, a sum total of 2,786 (199 x 2 x 7) html pages would need to be generated to accomplish the mission. For that week only. Obviously this is not a sustainable proposition. By contrast, the current version 1.5 approach always requires 7 html pages for each week, though full image support and super-clean display are sometimes sacrificed in the process.
Calendar 2.0 will deal with the issue using a database approach, but again, this is a different topic for a different time.
Last but not least, TEI Lite
We’ve discussed TEI Lite in the past as well and will not spend a great deal of time with it here, except to reiterate that it is a simple mark-up language that works well in styling full-text transcripts and other similar documents for the web.
There are nearly 2,000 TEI Lite documents included in Linus Pauling Day-by-Day, virtually all of them transcripts of letters sent or received by Linus Pauling. Transcript references within the Year XML are illustrated in fig. 2 above – they follow the exact same naming convention as our METS records, except that the mets.xml suffix is replaced by tei.xml. It is worth noting that rough drafts of most of the text that was eventually encoded in TEI for the Day-by-Day project were generated using OCR software. And while OCRing has improved mightily over the years, it still does have its quirks, which is why some of you might find, for example, the lower-case letter l substituted for by the number 1 in a few of the transcripts currently online. (we’re working on it)
The TEI Lite mark-up for the Pauling to Weaver letter is illustrated in fig. 6, as is, in fig. 7, the encoding for the biographical chronology (written by Dr. Robert Paradowski) used on the 1940 index page. Note, in particular, the use of <div> tags to declare a year’s worth of information in the Paradowski mark-up. These tags were included as markers for the xsl stylesheet to pull out a given chunk of data to be placed on a given year’s index page. The entire Paradowski chronology will be going online soon, and once again that project, as with Day-by-Day, will be generated from only this single record.
Custom XML, METS records and TEI Lite – these are the building blocks of Linus Pauling Day-by-Day. Check back later this week when we’ll discuss the means by which the blocks are assembled into a finished website using XSL stylesheets.