New Image Search and Catalogue Pages

Continuing the theme from our last post, Redesigning our Web Presence, here is a closer look at how we built the new Image Search feature as well as what it takes to create the catalogue pages showing the detailed holdings of our collections.

New Image Search Feature

Our main web search feature, now on all of our pages, is provided by Oregon State’s campus search engine running the Nutch software. This software works similar to Google and has web crawlers that find web pages and indexes them for searching. Each Documentary History website and Linus Pauling Day-by-Day have a search box that limits results to the pages and items on only that site.

However, the main search feature on the OSU Libraries homepage does not use the campus engine, and our websites and digital objects were not included. The Library’s search feature is powered by LibraryFind, built by a team at OSU using the Ruby on Rails web application software. LibraryFind harvests and indexes many records and data sources, but does not crawl web pages. We needed to get our records harvested by and indexed into LibraryFind.

Since our digital object records are stored using the METS format with MODS metadata, it was fairly easy to convert this to Dublin Core metadata, which can then be served up by an OAI-PMH provider. LibraryFind then checks our OAI-PMH provider and harvests our digital object metadata for indexing into LibraryFind. Thanks to Terry Reese for his assistance on setting up the provider software and for his work on LibraryFind.

One piece of metadata that our METS records didn’t have that we needed to add was a URL where the digital object sits once it’s part of our Documentary History websites or Linus Pauling Day-by-Day. Our own scripts and stylesheets didn’t need this information, since they built the pages, but LibraryFind needed the URL to know where to provide a link to from search results that included our digital objects. This required adding:

<mods:location><mods:url access="object in context" displayLabel="Linus Pauling and the Race for DNA: A Documentary History. Pictures and Illustrations.">
http://scarc.library.oregonstate.edu/coll/pauling/dna/pictures/1948i.61.html
</mods:url></mods:location>

as appropriate to our metadata files.

Now that our digital objects are harvested into LibraryFind, they are included in results for searches done from the Library’s homepage. We can also limit the search results to just our materials, and by turning on the image results view, we get our new Image Search functionality. Now it is easy to search for digitized materials that are online, including photographs, scans of documents, and more.

Image Search results, powered by LibraryFind

Image Search results, powered by LibraryFind

In the future, we plan to include more of our digital objects and increase support for complex items and multimedia files.

Building Catalogue Pages from EAD

Each of our collections are stored in Encoded Archival Description (EAD) XML files, which include collection information as well as detailed catalogue listings for most of our collections. Often these descriptions go down to the item level, and there are varying levels and hierarchies in use, depending on what was appropriate for the material. The EAD files are processed by XSLT stylesheet files, similar to the rest of our website, and ideally we’d like to have a single set of files that can handle the different description levels and that we don’t have to tweak for specific collections. Here is a snippet of EAD XML for the Pauling Photographs series:

<c02 level="file">
<did>
<container type="box">1933i</container>
<unittitle>Photographs and Images related to Linus Pauling, <unitdate>1933</unitdate>.</unittitle>
</did>

<c03 level="item">
<did>
<container type="folder">1933i.1</container> 
<unitid audience="internal">1311</unitid>
<unittitle>Linus Pauling at OSU (Oregon Agricultural College) to receive an honorary doctorate of science. Pictured from left are Dr. Marvin Gordon Neale, president of the University of Idaho, David C. Henny, Linus Pauling, Chancellor W. J. Kerr, and Charles A. Howard, state superintendent of public instruction in Oregon. June 5, 1933. "LP at OSU (OAC) honorary doctorate 1933" Photographer unknown. Black and white print courtesy of the Oregon State University Archives.</unittitle>
<unitdate>1933</unitdate>
</did>
</c03>

<c03 level="item">
<did>
<container type="folder">1933i.2</container>
<unittitle>D.C. Henny, C.A. Howard and Linus Pauling. Linus Pauling receiving an honorary degree from Oregon Agricultural College. Print courtesy of the Oregon State University Archives. Photographer unknown. Black and white print.</unittitle>
<unitdate>1933</unitdate>
</did>
</c03>

Our previous setup for processing catalogues was heavily modified from older EAD Cookbook stylesheet files and contained a large amount of custom code. One of the major sections we added was code that split up very large sections and box listings into separate web pages that were a reasonable size, instead of presenting hundreds of boxes or folders on each page. To accomplish this, the stylesheet used to first make a temporary file that built a high-level ‘menu’ of the catalogue, which had a range for each page that was to be created. The rest of the stylesheet would then use this menu file to determine what pages to create, and searched over the whole collection for the IDs of the start and end of each page. This was complicated by the fact that some box IDs are not easy to compare numerically, such as ‘1954h3.4’ or ‘NDi.2’. Usually the alphabetic characters were replaced with numbers or punctuation to facilitate comparing them against each other. This technique was not very efficient for large sections, and required lots of tweaking to be able to handle all the various box and folder IDs we have.

Also, separate stylesheet files had to be created to better handle the Pauling Papers since it was so much larger than anything else, which meant that it was a pain to maintain features in both sets of stylesheet files. For the newer set of catalogue XSLT stylesheet files, we took a few different approaches.

First, a lot of redundant code was eliminated through the use of more XSLT matching templates. These allow you to write one set of formatting code and reuse it whenever an element appears. This made it easier to work around EAD’s flexibility for container lists, so a series or box would get processed the same no matter where it was in the encoded hierarchy. Here are some examples of the EAD matching templates:

<xsl:template match="ead:unittitle">
<xsl:apply-templates/>
<xsl:if test="not(ends-with(., '.'))">.</xsl:if>
<xsl:text> </xsl:text>
<xsl:if test="following-sibling::ead:unittitle"><br/></xsl:if>
</xsl:template>

<xsl:template match="ead:unitid[@type]">
<xsl:choose>
<xsl:when test="@type = 'isbn'">
  <a class="nowrap" href="http://www.worldcat.org/search?q=isbn%3A{replace(., '-', '')}" title="Look up this title in WorldCat"><xsl:value-of select="."/></a></xsl:when>
<xsl:when test="@type = 'oclc'">
  <a class="nowrap" href="http://www.worldcat.org/search?q=no%3A{.}" title="Look up this title in WorldCat"><xsl:value-of select="."/></a></xsl:when>
<xsl:when test="@type = 'lcc'">
  <a class="nowrap" href="http://oasis.oregonstate.edu/search/c?SEARCH={replace(., ' ', '+')}" title="Look up this title in the OSU Libraries' catalog"><xsl:value-of select="."/></a></xsl:when>
<xsl:when test="@type = 'gdoc'">
  <a class="nowrap" href="http://oasis.oregonstate.edu/search/?searchtype=g&amp;searcharg={replace(replace(., ' ', '+'), '/', '%2F')}" title="Look up this title in the OSU Libraries' catalog"><xsl:value-of select="."/></a></xsl:when>
<xsl:otherwise><xsl:value-of select="."/></xsl:otherwise>
</xsl:choose>
</xsl:template>

Second, the code to break long sections into smaller pages for web display was redone, this time using the position of a box or item instead of the IDs. IDs are still used for display purposes, but the output is all based on position (such as 1-10, 11-20, etc.) This code is much cleaner since IDs are no longer directly compared. It’s also faster since it deals with the pages sequentially and doesn’t loop over the whole section every time a page is processed.

Example of new catalogue page of Pauling Correspondence

Example of new catalogue page of Pauling Correspondence

Third, instead of using HTML tables for the columns layout of catalogue pages, we switched to a CSS-based layout that approximates the look of columns and indents. This requires much less code in both the XSLT for processing and the output files.

Finally, all catalogues are processed by the same set of files, and separate ones for the Pauling Papers are no longer needed. This will enable us to make improvements faster, expanding our links to digital content and increasing the access options for our materials.

Advertisements

The Building Blocks of Linus Pauling Day-by-Day, Part 2: XSLT

In our last post we discussed the four XML pieces that form the content of Linus Pauling Day-by-Day. Once properly formatted, these disparate elements are then combined into a cohesive, web-ready whole using a set of rules called Extensible Stylesheet Language Transformations (XSLT). This post will delve more deeply into XSLT and how we use it on the Linus Pauling Day-by-Day site.

Creating the Various HTML Pages

A single XSLT  file controls the majority of the actions and templates that make up Linus Pauling Day-by-Day. Several Windows batch files, one for each year in the Calendar, control the calling of the calendar stylesheet with the calendar year XML. After some initial processing, the stylesheet begins to output all of the various pages. First, the main Calendar Homepage is created. This can be redundant when processing multiple years in a row, but ensures that the statistics and links are always up-to-date. Second, the Year Page is created. This pulls in extra data from the Pauling Chronology TEI file, specific to the year, displays a summary of the Paulings’ travel for that year, provides a snapshot image for the year, and lastly presents the activities for the year that don’t have a more specific date. An additional page is created that provides a larger image and some metadata for the snapshot document.

Then, acting on the months present in the data file, a Month Page is created. This presents the first calendar grid (how this is done is explained in the next section) followed by all of the days in that month that have activities and document data. If there is data about where the Paulings traveled, that is displayed in the calendar grid to give a visual overview of their itineraries. Finally, a Day Page is created for each day present within a month. The stylesheet simply acts on the days listed in the calendar XML file, and does not try to figure out how many days are in a given month in a given year. This style of acting on the data provided, instead of always doing things to a certain size or range, is a part of the programming paradigm of functional programming.   The Day Page features a large thumbnail of a document or photo, a smaller calendar grid, with travel information for the day displayed below if present, and then the activities and documents for that day.

fig. 1 Day Page view

fig. 1 Day Page view

Another page is then created with larger images of the document, additional pages if present, and some metadata about the document. As most of the output pages have a similar look and feel, there is a set of templates that handle the calendar navigation menu. Depending on what context the navigation is needed for (Year Page, Month Page, etc.), the output can adjust accordingly.

Computing the Calendar Grids

An important part of the look and feel of the project from early in the planning stage was to have familiar and user-friendly navigation for the large site, which meant that the traditional calendar grid would play a big role.

fig. 1 Calendar Grid example, including travel

fig. 2 Calendar Grid example, including travel

However, the data that we’ve amassed at the day level doesn’t have any information like the day of the week.  Beyond the 7 days of the week that January 1 could fall on, the grids for each month are complicated by whether or not a year is a leap year, resulting in 14 combinations that don’t occur in a regular pattern. If you’ve ever looked at a perpetual calendar, you’ll know that the grids are deceptive in how simple they look.

An initial design goal was to have the calendar stylesheet be able to handle all of the grid variations with only minimal help. We didn’t really want to have to prepare and store lots of information for each month of each year that the project would encompass. For each year, all the stylesheet needs is the day of the week that January 1 falls on (a number 1-7, representing Sunday-Saturday in our case), which is stored in the year XML, and then it can take that and the year and figure out all of the month grids. Fitting the algorithm for this into the XSLT stylesheet code is one of the more complex coding projects we’ve worked on.

It took several hundred lines of code, but we haven’t needed to mess with it since it was first written, even as we’ve added years and expanded the project. With the help of new features in XSLT version 2 and several more years of experience, the code could be rewritten to be cleaner and more efficient. However, because it was so reliable, time on stylesheet development was spent elsewhere.

fig. 1 Travel summary display for 1954

fig. 3 Travel summary display for 1954

Travel Summary Display

In the calendar XML data, each day can contain a basic <travel> element that states where the Paulings were on that day. This information is gleaned mostly from travel itineraries and speaking engagements, but also from correspondence or manuscript notes, and usually represents a city. On the Year page, we wanted to present a nice summary for the year of where the Paulings traveled to, and how long their trips were. Because the data was spread across days, and not already grouped together, it was a challenge at first to get trip totals that were always accurate and grouped appropriately. Using the new grouping features in XSLT version 2, the various trips could be grouped together appropriately, and then displayed in order. A range could be computed using the first and last entry of the group, and linked to the day page for the first entry. Now if you see an interesting place that the Paulings visited, you can go directly to the first day they were there, and see what they were doing that day. If more than one day was spent in a given place, a total is displayed showing how many days they were there.

METS for Documents and Images

The last post covered what METS records are and how we are using them. Because the files are fairly complex and have additional data that we aren’t using, the calendar stylesheet abstracts away the unneeded info and complexity. A temporary data structure is used to store the data needed, and then the calendar stylesheet refers to that in its templates, instead of dealing directly with the METS record and the descriptive MODS metadata. This approach is also used in the stylesheets for our Documentary History websites, and portions of the code were able to be repurposed for the calendar stylesheet.

Transcripts

As covered in the last post, the transcripts are stored in individual TEI Lite XML files. In our calendar year XML data, a simple <transcript> element added to a listing conveys the ID of the transcript file. The stylesheet can then take this ID, retrieve the file, apply the formatting templates, and output the result to the HTML pages. We use XML Namespaces to keep the type of source documents separate, and then the XSLT stylesheet can apply formatting to only a specific one. So, if we wanted to apply some styling to the title element, we could make sure that it was only the title element from a TEI file, not from a METS file. This allows us to have a group of formatting templates for TEI Lite files in a separate XSLT file, which can be imported by the calendar stylesheet, and none of it’s rules and templates will affect anything but the TEI Lite files. Since the code for the TEI Lite files and transcripts was already written for earlier projects, very little stylesheet code (less than 10 lines) was needed to add the ability to display nice, formatted transcripts.

The Building Blocks of Linus Pauling Day-by-Day

The technical workflow of Linus Pauling Day-by-Day

fig. 1 The technical workflow of Linus Pauling Day-by-Day

Any given page in the Linus Pauling Day-by-Day calendar is the product of up to four different XML records.  These records describe the various bits of data that comprise the project – be they document summaries, images or full-text transcripts.  The data contained in the various XML records are then interpreted by XSL stylesheets, which redistribute the information and generate local HTML files as their output.  The local HTML is, in turn, styled using CSS and then uploaded to the web.

Got all that?

In reality, the process is not quite as complicated as it may seem.  Today’s post is devoted to describing the four XML components that serve as building blocks for the calendar project.  Later on this week, we’ll talk more about the XSL side of things. (For some introductory information on XML and XSL, see this post, which discusses our use of these tools in creating The Pauling Catalogue)

Preliminary work in WordPerfect

The 68,000+ document summaries that comprise the meat of Linus Pauling Day-by-Day have been compiled by dozens of student assistants over the past ten years.  Typically, each student has been assigned a small portion of the collection and is charged with summarizing, in two or three sentences, each document contained in their assigned area.  These summaries have, to date, been written in a series of WordPerfect documents.

The January 30, 2009 launch of Linus Pauling Day-by-Day is being referred to, internally, as Calendar 1.5.  This is in part because of several major workflow changes that we have on tap for future calendar work, a big part of it being a movement out of WordPerfect.  While the word processing approach has worked pretty well for our students – it’s an interface with which they’re familiar, and includes all the usual advantages of a word processing application (spellchecking, etc.) – it does present fairly substantial complications for later stages of the work flow.

For one, everything has to be exported out of the word processing documents and marked-up in XML by hand.  For another, the movement out of a word processor and into xml often carries with it issues related to special characters, especially “&,” “smart quotes” and “em dash,” all of which can play havoc with certain xml applications.

Our plan for Calendar 2.0 is to move out of a word processing interface for the initial data entry in favor of an XForms interface, but that’s fodder for a later post.

The “Year XML”

Once a complete set of data has been compiled in WordPerfect, the content is then moved into XML.  All of the event summaries that our students write are contained in what might be called “Year XML” records.  An example of the types of data that are contained in these XML files is illustrated here in fig. 2.  Note that the information in the fig. 2 slide is truncated for display purposes – all of the “—-” markers represent text that has been removed for sake of scaling the illustration – but that generally speaking, the slide refers to the contents of the January 2, 1940 and August 7, 1940 Day-by-Day pages, the latter of which will also serve as our default illustrations reference.

Cursory inspection of the “Year XML” slide reveals one of the mark-up language’s key strengths – it’s simplicity.  For the most part, all of the tags used are easily-understandable to humans and the tag hierarchy that organizes the information follows a rather elementary logic.  The type of record is identified using <calendar>, months are tagged as <month>, days are tagged as <day> and document summaries are tagged as <event>.

The one semi-befuddling aspect of the “Year XML” syntax is the i.d. system used in reference to illustrations and transcripts.  After much experimentation, we have developed an internal naming system that works pretty well in assigning unique identifiers to every item in the Pauling archive.  The system is primarily based upon a document’s Pauling Catalogue series location and folder identifier, although since the Catalogue is not an item-level inventory (not completely, anyway) many items require further description in their identifier.  In the most common case of letters, the further description includes identifying the correspondents and including a date.

Fig. 2 provides an example of three identifiers.  The first is <record><id series=”09″>1940i.38</id></record>, which is the “Snapshot” reference for the 1940 index page.  This identifier is relatively simple as it defines a photograph contained in the Pauling Catalogue Photographs and Images series (series 09), the entirety of which is described on the item level.  So this XML identifier utilizes only a series notation (“09”) and a Pauling Catalogue notation (1940i.38).

The two other examples in Fig. 2 are both letters.  The first is <record><id series=”01″>corr136.9-lp-giguere-19400102</id></record>, a letter from Linus Pauling to Paul Giguere located in the Pauling Catalogue Correspondence series (series 01) in folder 136.9, and used on Day-by-Day as the illustration for the first week of January, 1940.  Because the folder is not further described on the item level, there exists a need for more explication in the identifier of this digital object.  Hence the listing of the correspondents involved and the date on which the letter was written.

The second example is a similar case: <record><id series=”11″>sci14.038.9-lp-weaver-19400807</id></record>, used as the Day-by-Day illustration for the first full week of August 1940.  In this instance, however, the original document is held in Pauling Catalogue series 11, Science, and is a letter written by Pauling to Warren Weaver on August 7, 1940.

METS Records to Power the Illustrations

We’ve talked about METS records a few times in the past, and have defined them as “all-in-one containers for digital objects.”  The Pauling to Weaver illustration mentioned above is a good example of this crucial piece of functionality, in that it is used as a week illustration in the August 1940 component of the Day-by-Day project, and is also a supporting document on the “It’s in the Blood!” documentary history website.  Despite its dual use, the original document was only ever scanned once and described once in METS and MODS.  Once an item is properly encoded in a METS record, it becomes instantly available for repurposing throughout our web presence.

Just about everything that we need to know about a scanned document is contained in its METS record.  In the case of Day-by-Day, we can see how various components of the Pauling to Weaver METS record are extracted to display on two different pages of the project.  Fig. 3 is a screenshot of this page, the “Week Index View” for the August 7, 1940 Day-by-Day page (all of the days for this given week will display the same illustration, but will obviously feature different events and transcripts listings).  Fig. 4 is a screenshot of the “Full Illustration View,” wherein the user has clicked on the Week Index illustration and gained access to both pages of the letter, as well as a more-detailed description of its contents.

Below (fig. 5) is an annotated version of the full METS record for the Pauling to Weaver letter.  As you’ll note once you click on it, fig. 5 is huge, but it’s worth a look in that, among other details, it gives an indication of how different components of the record are distributed to different pages. For example:

  • The Object Title, “Letter from Linus Pauling to Warren Weaver,” which displays in both views.
  • The Object Summary, “Pauling requests that Max Perutz…,” which displays only in the Full Illustration View.
  • The Object Date, used in both views.
  • The Local Object Identifier, sci14.038.9-lp-weaver-19400804, which displays at the bottom of the Full Illustration View.
  • The Page Count, used only in the Week Index View.
  • Crucially, the 400 pixel-width jpeg images, which are stored in one location on our web server (corresponding, again, with Pauling Catalogue series location), but in this example retrieved for display only in the Week Index View.
  • And likewise, the 600 pixel-width jpeg images, which are retrieved for Day-by-Day display in the Full Illustration View, but also used for reference display in the Documentary History projects.
fig. 5 An annotated version of the full METS record for digital object sci14.038.9-lp-weaver-19400804

fig. 5 An annotated version of the full METS record for digital object sci14.038.9-lp-weaver-19400804

An additional word about the illustrations used in Linus Pauling Day-by-Day

One of the major new components of the “1.5 Calendar” launch is full-page support for illustrations of ten pages or less – in the 1.0 version of the project, only the first page of each illustration was displayed, no matter the length of the original document.  Obviously this is a huge upgrade in the amount and quality of the content that we are able to provide from within the calendar.  The question begs to be asked, however, “why ten pages or less?”

In truth, the ten pages rule is somewhat arbitrary, but it works pretty well in coping with a major scaling problem that we face with the Day-by-Day project.  Users will note that the “Full Illustration View” for all Day-by-Day objects presents the full page content (when available) on a single html page, as opposed to the cleaner interface used on our Documentary History sites.  There’s a good reason for this.  In the instance of the Documentary History interface, essentially two html pages are generated for every original page of a document used as an illustration: a reference view and a large view.  This approach works well for the Documentary History application, in that even very large objects, such as Pauling’s 199-page long Berkeley Lectures manuscript, can be placed on the web without the size of a project exploding out of control – the Berkeley Lectures comprise 398 html pages, which is a lot, but still doable.

Linus Pauling Day-by-Day, on the other hand, currently requires that the full complement of images theoretically comprising an illustration be used, specifically, for each unique day of the week for which an image is chosen.  In other words, if the Berkeley Lectures were chosen to illustrate a week within the calendar, and the full content of the digital object were to be displayed for each day of that week using the same clean interface as a Documentary History, a sum total of 2,786 (199 x 2 x 7) html pages would need to be generated to accomplish the mission.  For that week only.  Obviously this is not a sustainable proposition. By contrast, the current version 1.5 approach always requires 7 html pages for each week, though full image support and super-clean display are sometimes sacrificed in the process.

Calendar 2.0 will deal with the issue using a database approach, but again, this is a different topic for a different time.

Last but not least, TEI Lite

We’ve discussed TEI Lite in the past as well and will not spend a great deal of time with it here, except to reiterate that it is a simple mark-up language that works well in styling full-text transcripts and other similar documents for the web.

There are nearly 2,000 TEI Lite documents included in Linus Pauling Day-by-Day, virtually all of them transcripts of letters sent or received by Linus Pauling.  Transcript references within the Year XML are illustrated in fig. 2 above – they follow the exact same naming convention as our METS records, except that the mets.xml suffix is replaced by tei.xml.  It is worth noting that rough drafts of most of the text that was eventually encoded in TEI for the Day-by-Day project were generated using OCR software.  And while OCRing has improved mightily over the years, it still does have its quirks, which is why some of you might find, for example, the lower-case letter l substituted for by the number 1 in a few of the transcripts currently online. (we’re working on it)

The TEI Lite mark-up for the Pauling to Weaver letter is illustrated in fig. 6, as is, in fig. 7, the encoding for the biographical chronology (written by Dr. Robert Paradowski) used on the 1940 index page.  Note, in particular, the use of <div> tags to declare a year’s worth of information in the Paradowski mark-up.  These tags were included as markers for the xsl stylesheet to pull out a given chunk of data to be placed on a given year’s index page.  The entire Paradowski chronology will be going online soon, and once again that project, as with Day-by-Day, will be generated from only this single record.

fig. 6 The TEI Lite mark-up for Pauling's August 7, 1940 letter to Warren Weaver.

fig. 6 The TEI Lite mark-up for Pauling's August 7, 1940 letter to Warren Weaver.

fig. 7 The TEI Lite mark-up for one year of the Paradowski chronology

fig. 7 The TEI Lite mark-up for one year of the Paradowski chronology

Custom XML, METS records and TEI Lite – these are the building blocks of Linus Pauling Day-by-Day.  Check back later this week when we’ll discuss the means by which the blocks are assembled into a finished website using XSL stylesheets.

A New TEI Lite Project: The Pauling Student Learning Curriculum

This past Friday we launched a new project about which we’re pretty excited.  As described in this press release, the Pauling Student Learning Curriculum is geared toward advanced high school- and college-age students, and is applicable to the teaching of both history and science.  As the press release also notes, the large amount of illustrative and hyperlinked content included in the website makes this a resource that should be useful to teachers and students anywhere in the world.

The history of this project is a long and interesting one.  The curriculum itself was originally designed nearly ten years ago for use by visiting fellows of the Linus Pauling Institute.  Over time, the content that was developed for the fellows program was repurposed for use by a University Honors College chemistry class that conducts research on the Pauling legacy every Winter term.  For several years we’ve been planning to post the text of the curriculum online, thinking that doing so would assist those chemistry students whose busy schedules preclude their spending an optimum amount of time in the Special Collections reading room.  It eventually dawned on us that the curriculum could actually be expanded into a powerful resource for use by teachers well-beyond the Oregon State University campus, and we’ve been developing the project with that goal in mind ever since.

The bulk of the curriculum is devoted to an abbreviated survey of Pauling’s life and work, presented in chronological order, and grouped under the following headings:

Throughout these sections, we’ve linked to any applicable objects that have already been digitized in support of our various Documentary History and Primary Source websites.

The curriculum also includes a series of instructions on “rules for research” in an archive.  We feel that this is especially important given the youth of our target audience, and hope that it will likewise provide for a positive introduction to the in’s and out’s of conducting scholarship with primary sources — an oftentimes intimidating process for researchers at any level.

The website itself is built with TEI Lite, which we’re using more and more in support of small but clean webpages that can be created and released comparatively quickly.   Though we’ve used the TEI (Text Encoding Initiative) Lite standard for numerous transcripts projects in the past, the first of our sites to be built entirely in TEI Lite was the biographical essay “Bernard Malamud: An Instinctive Friendship,” written by Chester Garrison and posted on our Bernard Malamud Papers page last month.  Plans for several additional TEI Lite-based “microsites” are currently in the works.

TEI Lite is a terrific tool in part because it is very simple to use.  In the example of the curriculum, all of the text, images, administrative metadata and much of the formatting that appears on the finished site are encoded in easily-learned and interpreted tags.  (We used XSL to generate the table of contents and to standardize the page formatting — e.g., where the images sit on a page and how the captions render.)  As a result, most of the mark-up required for these projects is at least roughed out by our student staff, which makes for a pretty efficient workflow within the department.

An example of the TEI Lite code for Page 2 of the Pauling Student Learning Curriculum is included after the jump.  We’ll be happy to answer any reader questions in the Comments to this post.

Continue reading

Creating The Pauling Catalogue: Page Design

Example text and image from The Pauling Catalogue scrapbooks series

Example text and image from The Pauling Catalogue scrapbooks series

[Part 7 of 9]

Once the publication’s text had been encoded and its illustrations selected, the next major challenge in creating The Pauling Catalogue was the actual design of the publication, page-by-page and volume-by-volume.  This process was carried out chiefly through the skillful implementation of Adobe’s InDesign software.

Having marked-up the raw text of the publication in XML, the catalogue’s textual content was ready to import into InDesign.  The result of this import, however, was a large mass of largely-unformatted text.  As much as possible, various characteristics were assigned to groups of text based upon a given group’s location along the xml heirarchy.  In this, specific sets of data were styled automatically through a pre-determined set of formatting rules specifying font, color and spacing rules.

The illustration below is an example of the output generated by this process.  The top level of the hierarchy for this series is the series itself, “Biographical.”  The second level of the heirarchy is the subseries, in this case “Personal and Family.”  The third level is a box title, and the fourth level is a folder title.  The illustration depicts the styling characteristics that were assigned to levels three and four of the hierarchy in the Biographical series — all box titles were formatted in red, all folder titles were formatted in black, and each had its own spacing rules.

Personal and Family

An example of the styling output in Biographical: Personal and Family

Formatting the publication’s illustrations was a significantly more complex proposition.  During its initial design phase, a placeholder image template was inserted on each page of The Pauling Catalogue. These templates consisted of three “boxes” meant to hold printing material — one box for the illustration, one box for the illustration’s catalogue identification number and one box for it’s caption — as well as two additional “boxes” in which non-printing design notes (an instructional note for the designer indicating where the image should be located on a given page, and an abbreviated description of the image used to generate illustration indices published at the back of each volume) were inserted. Identifier and caption information for each illustration was imported directly into these image templates from a series of master Excel spreadsheets. For pages not containing an illustration, the placeholder templates were later removed.

A representation of the multiple "boxes" that comprised the image template used for each of the 1,200+ illustrations in The Pauling Catalogue

A representation of the multiple "boxes" comprising the image template used to format each of the 1,200+ illustrations incorporated into The Pauling Catalogue.

A great deal of image correction was likewise conducted to remove flaws — dust speckles, for instance — from the selected illustrations.

A particularly extreme example of the image correction often required in the formatting of the illustrations used in The Pauling Catalogue.

A particularly extreme example of the image correction often required in the formatting of the illustrations used in The Pauling Catalogue.

A few original graphics were also created for this project, most notably the Pauling Catalogue badge designed for presentation on the cover of each volume.

The Pauling Catalogue badge and the Pauling Papers logo -- both are used as design elements throughout The Pauling Catalogue.

The Pauling Catalogue badge and the Pauling Papers logo -- both are used as design elements throughout The Pauling Catalogue.

Manual corrections were made to minimize “widows” and “orphans,” and a few additional manual changes were made directly in InDesign to correct small problems that would not be efficient to address in XML or XSL.

The Pauling Catalogue

The Pauling Catalogue

All design decisions were made with the overarching, two-pronged goal of this project kept in mind: 1) to disseminate scholarly information in a clean and useable manner and 2) to create a product that is aesthetically pleasing, browseable and of interest to a broad audience. While the primary market for The Pauling Catalogue is presumed to be academic libraries and history departments, we feel that the finished product is likewise at home on the coffee table or living room book shelf.

The Pauling Catalogue is available for purchase at http://paulingcatalogue.org

Creating The Pauling Catalogue: Typography and Proofreading

A sample of the Pauling Catalogue page layout. Vol 1, pg. 47

A sample of the Pauling Catalogue page layout. Vol 1, pg. 47

[Part 6 of 9]

As work on The Pauling Catalogue moved further in the direction of what would become the finished product, one surprisingly difficult set of decisions requiring action concerned the typography of the set’s 1,700+ pages.

After much research, two typefaces – Palatino Lynotype and Myriad Pro – and ten fonts were purchased for use in the publication. The purchase of these multiple font options was prompted by the need for a vast library of “special characters” (e.g. certain scientific symbols and non-Roman alphabetic characters) for use throughout the project. As mentioned earlier in this series, coping with the challenges presented by special characters was in part enabled by the use of XML.

XSL was likewise enlisted in the battle against rogue special characters.  Part of what is depicted in the illustration below is the report that was generated by a custom XSL transform written to search for unsupported characters in the hundreds of thousands of lines of XML code that comprise the The Pauling Catalogue dataset.  Characters for which no supportive font library could be found were displayed by the report as the symbols which have been highlighted in pink.  This XSL-based approach worked effectively in identifying problematic areas of text within draft versions of the six-volume set.

An example of the XSL reports used to locate "missing" special characters.  Inset are two examples of special characters created by hand.

An example of the XSL reports used to locate missing special characters.

It is worth noting too that, even with two typfaces and ten fonts working on his behalf, the project team’s graphic designer was still, in a few instances, forced to create certain symbol glyphs by hand.  Two such examples are spotlighted above — the Georgian letter “vin” and the scientific “double-arrow” symbol representing a system in equilibrium.

A great deal of proofreading was already built into the catalogue files as a result of nearly two decades worth of editing and spellchecking in WordPerfect.  Six local drafts of The Pauling Catalogue prototype were printed out on the OSU campus over the eighteen months that the editorial staff spent developing and refining the project. Each of these drafts was line-edited by the indispensable Special Collections student staff, with special attention paid to anomalies caused by special characters. An example of the reams of notes that the students compiled is included here.

Our students did an outstanding job of proofing the six local version drafts.

Our students did an outstanding job of proofing the six local-version drafts.

Image captions, page headers and prefatory materials were closely reviewed by the editorial staff.  That said, a few big mistakes nearly made their way into the finished project.  Can you spot the error below?  We didn’t until our review of the project bluelines – the last possible point at which changes could realistically be made!

The Pauling Catalogue

The Pauling Catalogue







The Pauling Catalogue is available for purchase at http://paulingcatalogue.org/

Creating The Pauling Catalogue: Formatting Text with XML and XSL

The text formatting cycle used in the creation of The Pauling Catalogue

A depiction of the text formatting cycle used in the creation of The Pauling Catalogue

[Part 4 of 9] One of the earliest and most pressing questions that the project team had to answer in constructing The Pauling Catalogue was how to go about formatting the text of such a massive document. The catalogue had been generated over many years as a series of WordPerfect word processing documents. While the word processing interface worked nicely in developing working documents, moving the catalogue data out of WordPerfect and into a flexible format more suitable to a professional printing operation was a significant challenge.

Ultimately it was decided to format the text data using Extensible Markup Language (XML) and Extensible Stylesheet Language Transformations (XSLT).

XML is an encoding schema that adds machine-readable value to existing data. Using a series of tags applied hierarchically throughout a given data set, XML greatly enhances one’s ability to manipulate data in useful, uniform ways. This manipulation of XML-encoded data is implemented using XSLT. In a nutshell, XSL transformations consist of sets of rules that locate specific pieces of data and then either order the data pieces in a certain prescribed way or hide the data pieces entirely.

The seventeen series that make up The Pauling Catalogue were each encoded in XML and manipulated – sometimes subtly and sometimes severely – using XSLT. The importance of this process to the creation of the end product is difficult to overstate. A perfect illustration of the power of XML and XSLT is provided by the Pauling Personal Library series in Volume 6 of The Pauling Catalogue.  Linus and Ava Helen Pauling’s personal library contains over 4,000 volumes and the published bibliography of all these items is 178 pages long. The XML mark-up for each book is shown here:

The XML encoding schema for one of the 4,000+ books in the Pauling Personal Library

The XML encoding schema for one of the 4,000+ books in the Pauling Personal Library

When the personal library was originally encoded for display on the web, all of the volumes that make up the series were arranged according to Library of Congress classification number.  As the details of The Pauling Catalogue publication were being determined, a decision was made that the books in the Personal Library would be more useful to users of a paper reference if each item were presented alphabetically by authors’ last name.

Carrying out this re-sort process by hand would have taken a very long time, as each book listing would require human “cutting and pasting” intervention to reorganize the records from call number order to alphabetical order.  However, because the content of the personal library had been described in XML, which is a machine-readable format, a series of new XSLT rules were instead utilized to automate the re-sort:

A series of rules written in XSL was used to re-sort the Pauling Personal Library arrangement.

A series of rules written in XSL was used to re-sort the Pauling Personal Library arrangement.

Consequently, a process that would have taken many days, if not weeks, to conduct “by hand,” was instead completed with a few hours of nimble XSLT coding.  The resulting differences from the version 2 proof to the version 6 proof of The Pauling Catalogue are immediately apparent:

From working version 2 to working version 6 of the publication, significant arrangement changes were made to the Pauling Personal Library

From draft version 2 to draft version 6 of the publication, significant arrangement changes were made to the Pauling Personal Library

Another major benefit of XML is the standard’s support for special characters.  When developing content in HTML, web authors have traditionally been required to describe special characters (e.g. scientific symbols or non-Roman alphabetic characters) using character entities.

For example, if one wished to insert a subscript number 2 into their text, HTML would require that the author use the character entity $#8322; to display the symbol in a web browser.  XML, on the other hand, uses tags that are both human- and machine-readable to describe and format a subscript 2. (see illustration below)

The situation is similar for symbols such as an arrow:  HTML requires the character entity $#8594; while XML “understands” and will output an arrow symbol entered into a properly-formed XML document.  This enhanced support of special characters encoding was terrifically helpful in the formatting of the Pauling Research Notebooks series, which contains a great number and variety of special characters:

An example of the special characters encoding used in the Pauling Research Notebooks series.  XML's support of special characters encoding is significantly more intuitive and elegant than the character entity requirements specified by html.

An example of the special characters mark-up used in the Pauling Research Notebooks series. XML's support of special characters encoding is significantly more intuitive and elegant than are the character entity requirements specified by HTML.

The Pauling Catalogue

The Pauling Catalogue

While XML and XSLT provided a strong platform for the formatting of The Pauling Catalogue text, the 1,200+ illustrations inserted throughout the six-volume publication presented a new and varied set of challenges.  The processes required to cope with these issues will be the subject of our next post in this series.

The Pauling Catalogue is available for purchase at http://paulingcatalogue.org