SCARC Has a New Virtual Home

We reported on something similar nearly three years ago and we’re happy to announce again today that the department behind the Ava Helen and Linus Pauling Papers – the Special Collections & Archives Research Center (SCARC) – now has a new and improved web presence.  As first revealed yesterday on our sister blog Speaking of History

It is with great pleasure that we announce the official launch of our new department website!  Please find it at http://scarc.library.oregonstate.edu and be sure to update your bookmarks away from the old University Archives and Special Collections sites, which will no longer be maintained…

(Click here to read the whole blog post, which offers a tour of all that is new and exciting on the site.)

The SCARC website took nearly ten months to complete from its initial conception to today’s reality, traveling through many different versions along the way.

The version 1 schematic of the proposed SCARC website.

The project was initiated out of need to present the contents of the former Special Collections and University Archives websites in one spot, the result of the two departments having administratively merged in September 2011.  While seemingly a daunting task, the process was made much easier by the fact that both departments had committed to Encoded Archival Description (EAD) as the platform on which archival collections would be prepared and published for consumption on the web.

Though the combined department serves as caretaker of 1,034 collections (and counting), this commitment to a uniform descriptive practice, in tandem with xsl programming skill within the department, allows now for the generation and presentation of more than 900 finding aids in a relatively painless manner.  Certain of the finding aids suffer from wonky encoding, but most of the description is currently presented in a clean and user-friendly fashion, and we’re working to clean up the few troublesome collections that remain.

A glimpse at a few collections from the M alphabetical sort.

The interplay between xml and xsl also allowed us to “tag” each collection with genres and themes, which could then be output as static html pages available to users who wish to browse the collections by a specific orientation.  In other words, if one is interested in viewing only collections containing photographs, that option is available. The same is true for collections containing audio, video or books, as well as collections focusing on university history, the history of science, natural resources, multiculturalism or local history.  And from the perspective of content description, the “tagging” process could not have been simpler.  The screenshot below provides a glimpse of the collections.xml file that guides the theme and genre sort, as well as the image and caption that illustrates each collection.  The simplicity of what you see is just another example of the power of xml (at least when placed in the hands of a strong xsl programmer).

The new SCARC site also made great use of the model presented by Linus Pauling Online in developing a series of portals that provide access to the great swath of digital content generated by Special Collections and University Archives over time. Beginning in the mid-1990s, both units began digitizing in mounting volume to the point where, in 2011, a huge amount of content was available 24/7 to many different types of user groups…if they were able to find it.  The new site organizes all of this digital content into collection development-specified portals meant to improve ease of access.

Linus Pauling Online

This approach was based on the example of Linus Pauling Online, a sleekly designed landing page for all of the Pauling-related digital content that had been generated by Special Collections over more than ten years worth of dedicated work.  Launched in January 2009, the beauty of Linus Pauling Online is its ability to provide “one stop shopping” for anyone interested in Pauling who happened upon our web space.  Now, a similar arrangement is offered to users seeking out digitized content related to the four primary collection development themes lying at the heart of SCARC’s mission.

The Oregon Multicultural Archives Digital Resources page.

The finalization of the SCARC website marks a major step forward for the department as it allows us now to present different types of intellectual content in a single space and using uniform tools.  In the near term, look for more digitized videos related to the history of Oregon State University or the Pacific Northwest.  Look also for more digitized and encoded manuscripts as well as deeper description of collections of all kinds.

Within SCARC the Pauling Papers are now one of 1,034 collections.  But fans of Pauling needn’t worry that his life and work will get lost in the shuffle.  Of SCARC’s ten professional staff, two FTE are devoted to history of science collections and plenty of ideas are in the works for new projects emanating out of the Pauling Papers.  Not least of these is the Pauling Blog, which will continue to provide weekly outreach related to history’s only recipient of two unshared Nobel Prizes.

An Expanded and Improved Pauling Awards Site

Pauling receiving the Priestley Medal, 1984.

It is with great pleasure that we announce the release of a revised and expanded version of the website Linus Pauling: Honors, Awards and Medals.   This new iteration of the site includes well over 600 images of nearly all of the 460 awards that Pauling received over the course of his 70+ years in science. (as well as the nine awards that he was given after he died)

Indeed, Pauling was a well-decorated individual, the recipient of 47 honorary doctorates and just about every important award that a scientist can get.  He started early: in 1931 he was the first winner of the A. C. Langmuir Prize, given by the American Chemical Society to the best young chemist in the nation.  Two years later he was the youngest person, at the time, to be inducted into the National Academy of Sciences.

The volume of awards that he received was so great that, on the surface, some appear to contradict others.  For example, he received the Presidential Medal for Merit in 1948 for the scientific work (including new rocket propellants and explosives) that he conducted on behalf of the Allied effort during World War II.  Thirteen years later, in 1961, he was named Humanist of the Year by the American Humanist Association.

He also received honors from organizations around the world: the Humphry Davy Medal from the Royal Society in 1947, the Amedeo Avogadro Medal from the Italian National Academy in 1956, the Lomonosov Medal from the Soviet Academy in 1978.  And he graciously accepted decorations from slightly lower profile organizations as well, including (our favorite) an honorary black belt from the All Japan Karate-Do Federations in 1980.

Linus Pauling: peace activist and honorary black belt, 1980.

He remains, of course, history’s only recipient of two unshared Nobel Prizes.

The Pauling Awards site was originally released in 2004 as a CONTENTdm collection.  In the years that followed, the talented student staff of the Special Collections & Archives Research Center photographed many more items that did not make it onto the 2004 release and also rephotographed artifacts that weren’t captured in exceptional quality the first time around.

Since 2004 many of our web projects have also moved to a METS and MODS based metadata platform, so the desire to add the new and improved image content to the Awards site dovetailed nicely with the desire to describe increasing percentages of our content in METS records.  (We talked a lot about METS and MODS in this series of posts from 2008 and 2009)

One exciting new technical innovation developed for the Pauling Awards revamp was the automated batch generation of METS records using the XSL scripting language.  In the past, all of our METS records have been created by hand. But because the Awards series in the Pauling finding aid is described on the item level, it was possible to develop scripts that would pull the item-level data out of the XML files in which the series has been encoded and create METS records batch generated by machines.  This batch of automated records did require a small amount of clean-up by our resident humans, but the process was hugely time efficient relative to creating each record by hand.  Because multiple components of the Pauling finding aid (like the photographs) are described on the item level, a batch process similar to what was developed for the Awards site may come into use again for future digital collections.

The interface for the 2011 version of the Pauling Awards site is also hugely improved over the 2004 version.  As with all of the METS-based websites that we have released over the years, the Awards site was designed using XSL and CSS, a process that allows for maximum flexibility.  As a result, users are now able to navigate the digital collection much more easily than was previously the case.  The item level metadata is better now too, allowing for improved alternative navigation, such as this subject view.

For more on the Pauling Awards site, see this press release, which, among other things, discusses some of the site’s new navigation features in greater depth.

New Image Search and Catalogue Pages

Continuing the theme from our last post, Redesigning our Web Presence, here is a closer look at how we built the new Image Search feature as well as what it takes to create the catalogue pages showing the detailed holdings of our collections.

New Image Search Feature

Our main web search feature, now on all of our pages, is provided by Oregon State’s campus search engine running the Nutch software. This software works similar to Google and has web crawlers that find web pages and indexes them for searching. Each Documentary History website and Linus Pauling Day-by-Day have a search box that limits results to the pages and items on only that site.

However, the main search feature on the OSU Libraries homepage does not use the campus engine, and our websites and digital objects were not included. The Library’s search feature is powered by LibraryFind, built by a team at OSU using the Ruby on Rails web application software. LibraryFind harvests and indexes many records and data sources, but does not crawl web pages. We needed to get our records harvested by and indexed into LibraryFind.

Since our digital object records are stored using the METS format with MODS metadata, it was fairly easy to convert this to Dublin Core metadata, which can then be served up by an OAI-PMH provider. LibraryFind then checks our OAI-PMH provider and harvests our digital object metadata for indexing into LibraryFind. Thanks to Terry Reese for his assistance on setting up the provider software and for his work on LibraryFind.

One piece of metadata that our METS records didn’t have that we needed to add was a URL where the digital object sits once it’s part of our Documentary History websites or Linus Pauling Day-by-Day. Our own scripts and stylesheets didn’t need this information, since they built the pages, but LibraryFind needed the URL to know where to provide a link to from search results that included our digital objects. This required adding:

<mods:location><mods:url access="object in context" displayLabel="Linus Pauling and the Race for DNA: A Documentary History. Pictures and Illustrations.">
http://scarc.library.oregonstate.edu/coll/pauling/dna/pictures/1948i.61.html
</mods:url></mods:location>

as appropriate to our metadata files.

Now that our digital objects are harvested into LibraryFind, they are included in results for searches done from the Library’s homepage. We can also limit the search results to just our materials, and by turning on the image results view, we get our new Image Search functionality. Now it is easy to search for digitized materials that are online, including photographs, scans of documents, and more.

Image Search results, powered by LibraryFind

Image Search results, powered by LibraryFind

In the future, we plan to include more of our digital objects and increase support for complex items and multimedia files.

Building Catalogue Pages from EAD

Each of our collections are stored in Encoded Archival Description (EAD) XML files, which include collection information as well as detailed catalogue listings for most of our collections. Often these descriptions go down to the item level, and there are varying levels and hierarchies in use, depending on what was appropriate for the material. The EAD files are processed by XSLT stylesheet files, similar to the rest of our website, and ideally we’d like to have a single set of files that can handle the different description levels and that we don’t have to tweak for specific collections. Here is a snippet of EAD XML for the Pauling Photographs series:

<c02 level="file">
<did>
<container type="box">1933i</container>
<unittitle>Photographs and Images related to Linus Pauling, <unitdate>1933</unitdate>.</unittitle>
</did>

<c03 level="item">
<did>
<container type="folder">1933i.1</container>
<unitid audience="internal">1311</unitid>
<unittitle>Linus Pauling at OSU (Oregon Agricultural College) to receive an honorary doctorate of science. Pictured from left are Dr. Marvin Gordon Neale, president of the University of Idaho, David C. Henny, Linus Pauling, Chancellor W. J. Kerr, and Charles A. Howard, state superintendent of public instruction in Oregon. June 5, 1933. "LP at OSU (OAC) honorary doctorate 1933" Photographer unknown. Black and white print courtesy of the Oregon State University Archives.</unittitle>
<unitdate>1933</unitdate>
</did>
</c03>

<c03 level="item">
<did>
<container type="folder">1933i.2</container>
<unittitle>D.C. Henny, C.A. Howard and Linus Pauling. Linus Pauling receiving an honorary degree from Oregon Agricultural College. Print courtesy of the Oregon State University Archives. Photographer unknown. Black and white print.</unittitle>
<unitdate>1933</unitdate>
</did>
</c03>

Our previous setup for processing catalogues was heavily modified from older EAD Cookbook stylesheet files and contained a large amount of custom code. One of the major sections we added was code that split up very large sections and box listings into separate web pages that were a reasonable size, instead of presenting hundreds of boxes or folders on each page. To accomplish this, the stylesheet used to first make a temporary file that built a high-level ‘menu’ of the catalogue, which had a range for each page that was to be created. The rest of the stylesheet would then use this menu file to determine what pages to create, and searched over the whole collection for the IDs of the start and end of each page. This was complicated by the fact that some box IDs are not easy to compare numerically, such as ‘1954h3.4’ or ‘NDi.2’. Usually the alphabetic characters were replaced with numbers or punctuation to facilitate comparing them against each other. This technique was not very efficient for large sections, and required lots of tweaking to be able to handle all the various box and folder IDs we have.

Also, separate stylesheet files had to be created to better handle the Pauling Papers since it was so much larger than anything else, which meant that it was a pain to maintain features in both sets of stylesheet files. For the newer set of catalogue XSLT stylesheet files, we took a few different approaches.

First, a lot of redundant code was eliminated through the use of more XSLT matching templates. These allow you to write one set of formatting code and reuse it whenever an element appears. This made it easier to work around EAD’s flexibility for container lists, so a series or box would get processed the same no matter where it was in the encoded hierarchy. Here are some examples of the EAD matching templates:

<xsl:template match="ead:unittitle">
<xsl:apply-templates/>
<xsl:if test="not(ends-with(., '.'))">.</xsl:if>
<xsl:text> </xsl:text>
<xsl:if test="following-sibling::ead:unittitle"><br/></xsl:if>
</xsl:template>

<xsl:template match="ead:unitid[@type]">
<xsl:choose>
<xsl:when test="@type = 'isbn'">
<a class="nowrap" href="http://www.worldcat.org/search?q=isbn%3A{replace(., '-', '')}" title="Look up this title in WorldCat"><xsl:value-of select="."/></a></xsl:when>
<xsl:when test="@type = 'oclc'">
<a class="nowrap" href="http://www.worldcat.org/search?q=no%3A{.}" title="Look up this title in WorldCat"><xsl:value-of select="."/></a></xsl:when>
<xsl:when test="@type = 'lcc'">
<a class="nowrap" href="http://oasis.oregonstate.edu/search/c?SEARCH={replace(., ' ', '+')}" title="Look up this title in the OSU Libraries' catalog"><xsl:value-of select="."/></a></xsl:when>
<xsl:when test="@type = 'gdoc'">
<a class="nowrap" href="http://oasis.oregonstate.edu/search/?searchtype=g&amp;searcharg={replace(replace(., ' ', '+'), '/', '%2F')}" title="Look up this title in the OSU Libraries' catalog"><xsl:value-of select="."/></a></xsl:when>
<xsl:otherwise><xsl:value-of select="."/></xsl:otherwise>
</xsl:choose>
</xsl:template>

Second, the code to break long sections into smaller pages for web display was redone, this time using the position of a box or item instead of the IDs. IDs are still used for display purposes, but the output is all based on position (such as 1-10, 11-20, etc.) This code is much cleaner since IDs are no longer directly compared. It’s also faster since it deals with the pages sequentially and doesn’t loop over the whole section every time a page is processed.

Example of new catalogue page of Pauling Correspondence

Example of new catalogue page of Pauling Correspondence

Third, instead of using HTML tables for the columns layout of catalogue pages, we switched to a CSS-based layout that approximates the look of columns and indents. This requires much less code in both the XSLT for processing and the output files.

Finally, all catalogues are processed by the same set of files, and separate ones for the Pauling Papers are no longer needed. This will enable us to make improvements faster, expanding our links to digital content and increasing the access options for our materials.

The Building Blocks of Linus Pauling Day-by-Day, Part 2: XSLT

In our last post we discussed the four XML pieces that form the content of Linus Pauling Day-by-Day. Once properly formatted, these disparate elements are then combined into a cohesive, web-ready whole using a set of rules called Extensible Stylesheet Language Transformations (XSLT). This post will delve more deeply into XSLT and how we use it on the Linus Pauling Day-by-Day site.

Creating the Various HTML Pages

A single XSLT  file controls the majority of the actions and templates that make up Linus Pauling Day-by-Day. Several Windows batch files, one for each year in the Calendar, control the calling of the calendar stylesheet with the calendar year XML. After some initial processing, the stylesheet begins to output all of the various pages. First, the main Calendar Homepage is created. This can be redundant when processing multiple years in a row, but ensures that the statistics and links are always up-to-date. Second, the Year Page is created. This pulls in extra data from the Pauling Chronology TEI file, specific to the year, displays a summary of the Paulings’ travel for that year, provides a snapshot image for the year, and lastly presents the activities for the year that don’t have a more specific date. An additional page is created that provides a larger image and some metadata for the snapshot document.

Then, acting on the months present in the data file, a Month Page is created. This presents the first calendar grid (how this is done is explained in the next section) followed by all of the days in that month that have activities and document data. If there is data about where the Paulings traveled, that is displayed in the calendar grid to give a visual overview of their itineraries. Finally, a Day Page is created for each day present within a month. The stylesheet simply acts on the days listed in the calendar XML file, and does not try to figure out how many days are in a given month in a given year. This style of acting on the data provided, instead of always doing things to a certain size or range, is a part of the programming paradigm of functional programming.   The Day Page features a large thumbnail of a document or photo, a smaller calendar grid, with travel information for the day displayed below if present, and then the activities and documents for that day.

fig. 1 Day Page view

fig. 1 Day Page view

Another page is then created with larger images of the document, additional pages if present, and some metadata about the document. As most of the output pages have a similar look and feel, there is a set of templates that handle the calendar navigation menu. Depending on what context the navigation is needed for (Year Page, Month Page, etc.), the output can adjust accordingly.

Computing the Calendar Grids

An important part of the look and feel of the project from early in the planning stage was to have familiar and user-friendly navigation for the large site, which meant that the traditional calendar grid would play a big role.

fig. 1 Calendar Grid example, including travel

fig. 2 Calendar Grid example, including travel

However, the data that we’ve amassed at the day level doesn’t have any information like the day of the week.  Beyond the 7 days of the week that January 1 could fall on, the grids for each month are complicated by whether or not a year is a leap year, resulting in 14 combinations that don’t occur in a regular pattern. If you’ve ever looked at a perpetual calendar, you’ll know that the grids are deceptive in how simple they look.

An initial design goal was to have the calendar stylesheet be able to handle all of the grid variations with only minimal help. We didn’t really want to have to prepare and store lots of information for each month of each year that the project would encompass. For each year, all the stylesheet needs is the day of the week that January 1 falls on (a number 1-7, representing Sunday-Saturday in our case), which is stored in the year XML, and then it can take that and the year and figure out all of the month grids. Fitting the algorithm for this into the XSLT stylesheet code is one of the more complex coding projects we’ve worked on.

It took several hundred lines of code, but we haven’t needed to mess with it since it was first written, even as we’ve added years and expanded the project. With the help of new features in XSLT version 2 and several more years of experience, the code could be rewritten to be cleaner and more efficient. However, because it was so reliable, time on stylesheet development was spent elsewhere.

fig. 1 Travel summary display for 1954

fig. 3 Travel summary display for 1954

Travel Summary Display

In the calendar XML data, each day can contain a basic <travel> element that states where the Paulings were on that day. This information is gleaned mostly from travel itineraries and speaking engagements, but also from correspondence or manuscript notes, and usually represents a city. On the Year page, we wanted to present a nice summary for the year of where the Paulings traveled to, and how long their trips were. Because the data was spread across days, and not already grouped together, it was a challenge at first to get trip totals that were always accurate and grouped appropriately. Using the new grouping features in XSLT version 2, the various trips could be grouped together appropriately, and then displayed in order. A range could be computed using the first and last entry of the group, and linked to the day page for the first entry. Now if you see an interesting place that the Paulings visited, you can go directly to the first day they were there, and see what they were doing that day. If more than one day was spent in a given place, a total is displayed showing how many days they were there.

METS for Documents and Images

The last post covered what METS records are and how we are using them. Because the files are fairly complex and have additional data that we aren’t using, the calendar stylesheet abstracts away the unneeded info and complexity. A temporary data structure is used to store the data needed, and then the calendar stylesheet refers to that in its templates, instead of dealing directly with the METS record and the descriptive MODS metadata. This approach is also used in the stylesheets for our Documentary History websites, and portions of the code were able to be repurposed for the calendar stylesheet.

Transcripts

As covered in the last post, the transcripts are stored in individual TEI Lite XML files. In our calendar year XML data, a simple <transcript> element added to a listing conveys the ID of the transcript file. The stylesheet can then take this ID, retrieve the file, apply the formatting templates, and output the result to the HTML pages. We use XML Namespaces to keep the type of source documents separate, and then the XSLT stylesheet can apply formatting to only a specific one. So, if we wanted to apply some styling to the title element, we could make sure that it was only the title element from a TEI file, not from a METS file. This allows us to have a group of formatting templates for TEI Lite files in a separate XSLT file, which can be imported by the calendar stylesheet, and none of it’s rules and templates will affect anything but the TEI Lite files. Since the code for the TEI Lite files and transcripts was already written for earlier projects, very little stylesheet code (less than 10 lines) was needed to add the ability to display nice, formatted transcripts.

The Building Blocks of Linus Pauling Day-by-Day

The technical workflow of Linus Pauling Day-by-Day

fig. 1 The technical workflow of Linus Pauling Day-by-Day

Any given page in the Linus Pauling Day-by-Day calendar is the product of up to four different XML records.  These records describe the various bits of data that comprise the project – be they document summaries, images or full-text transcripts.  The data contained in the various XML records are then interpreted by XSL stylesheets, which redistribute the information and generate local HTML files as their output.  The local HTML is, in turn, styled using CSS and then uploaded to the web.

Got all that?

In reality, the process is not quite as complicated as it may seem.  Today’s post is devoted to describing the four XML components that serve as building blocks for the calendar project.  Later on this week, we’ll talk more about the XSL side of things. (For some introductory information on XML and XSL, see this post, which discusses our use of these tools in creating The Pauling Catalogue)

Preliminary work in WordPerfect

The 68,000+ document summaries that comprise the meat of Linus Pauling Day-by-Day have been compiled by dozens of student assistants over the past ten years.  Typically, each student has been assigned a small portion of the collection and is charged with summarizing, in two or three sentences, each document contained in their assigned area.  These summaries have, to date, been written in a series of WordPerfect documents.

The January 30, 2009 launch of Linus Pauling Day-by-Day is being referred to, internally, as Calendar 1.5.  This is in part because of several major workflow changes that we have on tap for future calendar work, a big part of it being a movement out of WordPerfect.  While the word processing approach has worked pretty well for our students – it’s an interface with which they’re familiar, and includes all the usual advantages of a word processing application (spellchecking, etc.) – it does present fairly substantial complications for later stages of the work flow.

For one, everything has to be exported out of the word processing documents and marked-up in XML by hand.  For another, the movement out of a word processor and into xml often carries with it issues related to special characters, especially “&,” “smart quotes” and “em dash,” all of which can play havoc with certain xml applications.

Our plan for Calendar 2.0 is to move out of a word processing interface for the initial data entry in favor of an XForms interface, but that’s fodder for a later post.

The “Year XML”

Once a complete set of data has been compiled in WordPerfect, the content is then moved into XML.  All of the event summaries that our students write are contained in what might be called “Year XML” records.  An example of the types of data that are contained in these XML files is illustrated here in fig. 2.  Note that the information in the fig. 2 slide is truncated for display purposes – all of the “—-” markers represent text that has been removed for sake of scaling the illustration – but that generally speaking, the slide refers to the contents of the January 2, 1940 and August 7, 1940 Day-by-Day pages, the latter of which will also serve as our default illustrations reference.

Cursory inspection of the “Year XML” slide reveals one of the mark-up language’s key strengths – it’s simplicity.  For the most part, all of the tags used are easily-understandable to humans and the tag hierarchy that organizes the information follows a rather elementary logic.  The type of record is identified using <calendar>, months are tagged as <month>, days are tagged as <day> and document summaries are tagged as <event>.

The one semi-befuddling aspect of the “Year XML” syntax is the i.d. system used in reference to illustrations and transcripts.  After much experimentation, we have developed an internal naming system that works pretty well in assigning unique identifiers to every item in the Pauling archive.  The system is primarily based upon a document’s Pauling Catalogue series location and folder identifier, although since the Catalogue is not an item-level inventory (not completely, anyway) many items require further description in their identifier.  In the most common case of letters, the further description includes identifying the correspondents and including a date.

Fig. 2 provides an example of three identifiers.  The first is <record><id series=”09″>1940i.38</id></record>, which is the “Snapshot” reference for the 1940 index page.  This identifier is relatively simple as it defines a photograph contained in the Pauling Catalogue Photographs and Images series (series 09), the entirety of which is described on the item level.  So this XML identifier utilizes only a series notation (“09”) and a Pauling Catalogue notation (1940i.38).

The two other examples in Fig. 2 are both letters.  The first is <record><id series=”01″>corr136.9-lp-giguere-19400102</id></record>, a letter from Linus Pauling to Paul Giguere located in the Pauling Catalogue Correspondence series (series 01) in folder 136.9, and used on Day-by-Day as the illustration for the first week of January, 1940.  Because the folder is not further described on the item level, there exists a need for more explication in the identifier of this digital object.  Hence the listing of the correspondents involved and the date on which the letter was written.

The second example is a similar case: <record><id series=”11″>sci14.038.9-lp-weaver-19400807</id></record>, used as the Day-by-Day illustration for the first full week of August 1940.  In this instance, however, the original document is held in Pauling Catalogue series 11, Science, and is a letter written by Pauling to Warren Weaver on August 7, 1940.

METS Records to Power the Illustrations

We’ve talked about METS records a few times in the past, and have defined them as “all-in-one containers for digital objects.”  The Pauling to Weaver illustration mentioned above is a good example of this crucial piece of functionality, in that it is used as a week illustration in the August 1940 component of the Day-by-Day project, and is also a supporting document on the “It’s in the Blood!” documentary history website.  Despite its dual use, the original document was only ever scanned once and described once in METS and MODS.  Once an item is properly encoded in a METS record, it becomes instantly available for repurposing throughout our web presence.

Just about everything that we need to know about a scanned document is contained in its METS record.  In the case of Day-by-Day, we can see how various components of the Pauling to Weaver METS record are extracted to display on two different pages of the project.  Fig. 3 is a screenshot of this page, the “Week Index View” for the August 7, 1940 Day-by-Day page (all of the days for this given week will display the same illustration, but will obviously feature different events and transcripts listings).  Fig. 4 is a screenshot of the “Full Illustration View,” wherein the user has clicked on the Week Index illustration and gained access to both pages of the letter, as well as a more-detailed description of its contents.

Below (fig. 5) is an annotated version of the full METS record for the Pauling to Weaver letter.  As you’ll note once you click on it, fig. 5 is huge, but it’s worth a look in that, among other details, it gives an indication of how different components of the record are distributed to different pages. For example:

  • The Object Title, “Letter from Linus Pauling to Warren Weaver,” which displays in both views.
  • The Object Summary, “Pauling requests that Max Perutz…,” which displays only in the Full Illustration View.
  • The Object Date, used in both views.
  • The Local Object Identifier, sci14.038.9-lp-weaver-19400804, which displays at the bottom of the Full Illustration View.
  • The Page Count, used only in the Week Index View.
  • Crucially, the 400 pixel-width jpeg images, which are stored in one location on our web server (corresponding, again, with Pauling Catalogue series location), but in this example retrieved for display only in the Week Index View.
  • And likewise, the 600 pixel-width jpeg images, which are retrieved for Day-by-Day display in the Full Illustration View, but also used for reference display in the Documentary History projects.
fig. 5 An annotated version of the full METS record for digital object sci14.038.9-lp-weaver-19400804

fig. 5 An annotated version of the full METS record for digital object sci14.038.9-lp-weaver-19400804

An additional word about the illustrations used in Linus Pauling Day-by-Day

One of the major new components of the “1.5 Calendar” launch is full-page support for illustrations of ten pages or less – in the 1.0 version of the project, only the first page of each illustration was displayed, no matter the length of the original document.  Obviously this is a huge upgrade in the amount and quality of the content that we are able to provide from within the calendar.  The question begs to be asked, however, “why ten pages or less?”

In truth, the ten pages rule is somewhat arbitrary, but it works pretty well in coping with a major scaling problem that we face with the Day-by-Day project.  Users will note that the “Full Illustration View” for all Day-by-Day objects presents the full page content (when available) on a single html page, as opposed to the cleaner interface used on our Documentary History sites.  There’s a good reason for this.  In the instance of the Documentary History interface, essentially two html pages are generated for every original page of a document used as an illustration: a reference view and a large view.  This approach works well for the Documentary History application, in that even very large objects, such as Pauling’s 199-page long Berkeley Lectures manuscript, can be placed on the web without the size of a project exploding out of control – the Berkeley Lectures comprise 398 html pages, which is a lot, but still doable.

Linus Pauling Day-by-Day, on the other hand, currently requires that the full complement of images theoretically comprising an illustration be used, specifically, for each unique day of the week for which an image is chosen.  In other words, if the Berkeley Lectures were chosen to illustrate a week within the calendar, and the full content of the digital object were to be displayed for each day of that week using the same clean interface as a Documentary History, a sum total of 2,786 (199 x 2 x 7) html pages would need to be generated to accomplish the mission.  For that week only.  Obviously this is not a sustainable proposition. By contrast, the current version 1.5 approach always requires 7 html pages for each week, though full image support and super-clean display are sometimes sacrificed in the process.

Calendar 2.0 will deal with the issue using a database approach, but again, this is a different topic for a different time.

Last but not least, TEI Lite

We’ve discussed TEI Lite in the past as well and will not spend a great deal of time with it here, except to reiterate that it is a simple mark-up language that works well in styling full-text transcripts and other similar documents for the web.

There are nearly 2,000 TEI Lite documents included in Linus Pauling Day-by-Day, virtually all of them transcripts of letters sent or received by Linus Pauling.  Transcript references within the Year XML are illustrated in fig. 2 above – they follow the exact same naming convention as our METS records, except that the mets.xml suffix is replaced by tei.xml.  It is worth noting that rough drafts of most of the text that was eventually encoded in TEI for the Day-by-Day project were generated using OCR software.  And while OCRing has improved mightily over the years, it still does have its quirks, which is why some of you might find, for example, the lower-case letter l substituted for by the number 1 in a few of the transcripts currently online. (we’re working on it)

The TEI Lite mark-up for the Pauling to Weaver letter is illustrated in fig. 6, as is, in fig. 7, the encoding for the biographical chronology (written by Dr. Robert Paradowski) used on the 1940 index page.  Note, in particular, the use of <div> tags to declare a year’s worth of information in the Paradowski mark-up.  These tags were included as markers for the xsl stylesheet to pull out a given chunk of data to be placed on a given year’s index page.  The entire Paradowski chronology will be going online soon, and once again that project, as with Day-by-Day, will be generated from only this single record.

fig. 6 The TEI Lite mark-up for Pauling's August 7, 1940 letter to Warren Weaver.

fig. 6 The TEI Lite mark-up for Pauling's August 7, 1940 letter to Warren Weaver.

fig. 7 The TEI Lite mark-up for one year of the Paradowski chronology

fig. 7 The TEI Lite mark-up for one year of the Paradowski chronology

Custom XML, METS records and TEI Lite – these are the building blocks of Linus Pauling Day-by-Day.  Check back later this week when we’ll discuss the means by which the blocks are assembled into a finished website using XSL stylesheets.

The Martha Chase Effect

Martha Chase and Alfred Hershey, 1953.

Martha Chase and Alfred Hershey, 1953.

The Phenomenon

It pretty well goes without saying that the primary mission of the Oregon State University Libraries Special Collections is to preserve, describe and make available the Ava Helen and Linus Pauling Papers.  Beginning, more or less, with the Pauling centenary in 2001, the main focus of our Pauling-related work has been description and accessibility via the web.  In so doing, we have scanned over one terabyte of data and created, at minimum, tens of thousands of static html pages devoted to the life, work and legacy of Linus Pauling and, to a lesser extent, Ava Helen Pauling.

Knowing this, one might reasonably assume that the top search engine query channeling into the content that we have created would be “Linus Pauling,” or some variant therof.  A reasonable assumption indeed but, as it turns out, quite wrong.  In 2008, as in 2007 and 2006 (a close second in 2005), the top keyword query for those who found our content through search was…”Martha Chase.”

Martha Chase was a geneticist who, in collaboration with Alfred Hershey, made an important contribution to the DNA story as it played out in the early 1950s.  Prior to Chase and Hershey’s work, scientists were mixed on the question as to what, exactly, was the genetic material.  Many researchers, Pauling included, initially felt that the stuff of heredity was contained in proteins.  Others, of course, eventually theorized that DNA was the source of genetic information.  Using an ordinary blender as their primary tool, Hershey and Chase devised a famous experiment which proved conclusively that DNA did, in fact, carry the genetic code.

Diagram of the Hershey-Chase Blender Experiment.  Image by Eric Arnold.

Diagram of the Hershey-Chase Blender Experiment. Image by Eric Arnold.

The breadth of Chase-related content that we have digitized is infinitesimally-small relative to the “reams” devoted to Pauling — this page and this page are pretty much it.  And yet, in the context of search, Martha Chase is the top draw to our resources.  It would seem then, that in the marketplace for information — at least that which is retrieved by search — supply and demand for Martha Chase approach their equilibrium at the two pages devoted to her work on our “Linus Pauling and the Race for DNA” site.

Looking through the web statistics, the phenomenon is remarkably consistent.  Not only has “Martha Chase” been the top search query for our domain over, essentially, the past four years, it was also the top search query for our domain over the final week of 2008.  Indeed, the trend has strengthened to the point where today, those who conduct the simple “Linus Pauling” search in Google will note “martha chase” as a recommended search related to Pauling, though in reality the two had little or no interaction at all.

Learning from the Chase Effect

Looking forward, the Chase Effect has become something that we’re thinking more and more about as we begin to develop new projects for the web.  Our top objective will always be to document Pauling’s impact on any number of fields, but in so doing there likely exists a great deal of opportunity for serving different user groups from what might be called “Chaseian” corners of the web.

To use the old many-fish-in-the-sea analogy, there is a lot of content related to Pauling on the Internet, and though we are the primary contributor to this content, we do compete for pageviews with scads of other extremely diverse projects.  (Take a look at the results for the simple “Linus Pauling” Google search to see how diverse the content providers really are.)  So it’s pretty clear that the Pauling sea is quite large and filled with all manner of creatures.

By comparison, Martha Chase represents a much smaller body of water and, in particular, image searches for her — which is probably where the lion’s share of our successful Chase referrals come from — are dominated by the osulibrary.oregonstate.edu/specialcollections domain.

The idea for future work is to think of the Pauling Papers as a collection of collections in attempting to uncover more Martha Chases.

To an extent we have already, somewhat unwittingly, done this with certain of the Key Participants highlighted on our various documentary history websites.  The Harvey Itano Key Participants page, for example, is the second result returned by Google for “Harvey Itano” searches.  Erwin Chargaff‘s page is seventh,  Arnold Sommerfeld‘s page is eighth and Edward Condon‘s is tenth, to name a few more examples.  In each instance, by developing mini-portals related to specific colleagues important to Pauling’s work, we have created resources that help meet the information demand of a non-Pauling user base.

As we standardize our metadata platforms — upgrading older projects and maintaining the standard for new — and, in the process, increase our capacity to “remix” our digital objects, the idea of enhancing existing mini-portals and creating new ones will emerge as an important consideration for our digitization workflow.  This is something that we’ll be talking a lot more about in the months to come.

2008: The Year in Pauling

Linus Pauling at his Deer Flat Ranch home, near Big Sur, California. 1987.

Linus Pauling at his Deer Flat Ranch home, near Big Sur, California. 1987.

Notable Projects and Events

This has been a terrifically-productive year for the Oregon State University Libraries Special Collections:

Behind the Numbers

The various websites that we have launched over the years continue to attract a fairly large volume of traffic.   Over the past twelve months, our web domain has been the focus of 11.93 million pageviews. (A pageview being officially defined as “A request to the web server by a visitor’s browser for any web page; this excludes images, javascript, and other generally embedded file types.”)  This total is a marked decrease from the 2007 measurement of 14.7 million pageviews.  However, our new releases this year were more of a niche variety, whereas 2007 marked the launch of “Linus Pauling and the International Peace Movement: A Documentary History,” as well as two additional new years of Day-by-Day content.  The difference in these types of projects help explain the downturn.

The largest source of 2008 traffic (4.48 million pageviews) is an oldie but a goody – Linus Pauling Research Notebooks.   Originally released in 2002 and consisting of well-over 15,000 html files, this cross-indexed digital version of Pauling’s 46 research notebooks has, by our count, generated roughly 39.5 million pageviews over the course of its existence.  The research notebooks site is also the only one of our many Pauling-centric projects to bubble up into the top 10 of Google’s results for the simple Linus Pauling keyword search. (not that we’re complaining, of course…)

Second in popularity is, per usual, the mammoth Linus Pauling Day-by-Day project (3.71 million), which currently provides a daily accounting of Linus and Ava Helen Pauling’s activities for the years 1930-1954.

Our four Documentary History websites jockey back and forth for third through sixth places.  Having received a big update in February, the Bond site is a clear favorite right now, though Blood will probably move up as well, having also been recently revised.  Here’s a look at how the numbers are shaking out for the major projects under the specialcollections/coll/pauling domain.

stats

Check back on Friday for a few thoughts on search and a peek at 2009.

“It’s in the Blood!” A Revised, METS-based Website

Pastel drawing of a hemoglobin molecule by Roger Hayward, 1964.

Pastel drawing of a hemoglobin molecule by Roger Hayward, 1964.

“It [hemoglobin] is a good substance from the standpoint of a chemist, because of its availability. All you need to do is to catch somebody, introduce a hypodermic needle and draw out a sample of blood. A standard victim of this practice, weighing perhaps 120 pounds (it’s easier to catch them small!) contains in the red corpuscles in his blood one and two-tenths pounds of hemoglobin.”
– Linus Pauling, 1966.

Some reasonably big news to share today. As announced here, we have launched a revised and expanded version of our 2005 release “It’s in the Blood!  A Documentary History of Linus Pauling, Hemoglobin and Sickle Cell Anemia.”

Similar to the revised version of our “Nature of the Chemical Bond” documentary history website, which was launched this past February, the “second edition” of “It’s in the Blood!” contains a ton more content: the final tally runs to 53 new letters, 458 pages of added manuscripts and papers, 18 new pictures and 11 new audio and video clips.  The metadata for all of the site’s content is drastically improved as well — a fact that is most immediately evident on the various Key Participants pages, which have been transformed from rather spartan affairs to content-rich resources like this page devoted to Harvey Itano.

Aside from the self-evident benefits of adding more content to our pages, revising the older documentary histories has also prompted our digitization work more in the direction of a uniform METS-based platform.  We’ll talk a lot more about them at a later time, but for now it’s sufficient to define METS records as all-in-one containers for digital objects.

We use METS (Metadata Encoding and Transmission Standard) and MODS (Metadata Object Description Schema – both are flavors of XML) not only to describe a scanned item in a qualitative sense, but also to define how the item displays on a page.

For example, the METS record that “powers” the hemoglobin molecule above includes an internal i.d., the date and creator of the record, the image caption and it’s copyright data, the creator of the image (Roger Hayward) and any individuals or organizations associated with it. (Linus Pauling, in this case, since the drawing was published in Pauling and Hayward’s The Architecture of Molecules.)  The record also stores the date of the item’s creation (1964), and the genre type of the original document. (We use the Library of Congress’ Basic Genre Terms for Cultural Heritage Materials as our genre authority.  The Hayward item is defined by BGTCHM as an “illustration.”)

The METS record also defines certain display characteristics that are then interpreted by the XSL stylesheets that build our HTML pages.  Again using our hemoglobin molecule as an example, the METS record which defines the object’s output declares that it can be displayed at one of four different sizes.  The 150-pixel width display is used for all images inserted as Narrative sidebar images (Hemoglobin is on this page), as well as all images aggregated onto a given All Documents and Media index page. (Hemoglobin is about 3/4 of the way down the Pictures and Illustrations index.)  A 400-pixel width version will be used in a revised version of our “Linus Pauling Day-by-Day” project, which we hope to launch later this year.  The 600-pixel width “reference images” display like this, and the 900-pixel width big kahunas look like this.

METS records take a while to create, but the payoff is well worth the effort.  The flexibility that METS provides both within and across projects is of huge importance to us — when building really big websites and/or multiple websites with subject matter that tends to overlap, (the documentary histories, the Day-by-Day calendar and the Pauling Student Learning Curriculum, e.g.) it is way more efficient to be able to describe an object once but use it again and again.

Right now, “Linus Pauling and the Race for DNA” is the last of our documentary history sites still requiring METS attentions.  Once it’s revised, we’ll be able to start thinking concretely about providing different types of portal views and search tools for our growing METS cache (well over 3,000 records currently), an eventuality that promises a whole new range of possibilities for our entire digitization workflow.  But that’s a different topic for a different day…

A New TEI Lite Project: The Pauling Student Learning Curriculum

This past Friday we launched a new project about which we’re pretty excited.  As described in this press release, the Pauling Student Learning Curriculum is geared toward advanced high school- and college-age students, and is applicable to the teaching of both history and science.  As the press release also notes, the large amount of illustrative and hyperlinked content included in the website makes this a resource that should be useful to teachers and students anywhere in the world.

The history of this project is a long and interesting one.  The curriculum itself was originally designed nearly ten years ago for use by visiting fellows of the Linus Pauling Institute.  Over time, the content that was developed for the fellows program was repurposed for use by a University Honors College chemistry class that conducts research on the Pauling legacy every Winter term.  For several years we’ve been planning to post the text of the curriculum online, thinking that doing so would assist those chemistry students whose busy schedules preclude their spending an optimum amount of time in the Special Collections reading room.  It eventually dawned on us that the curriculum could actually be expanded into a powerful resource for use by teachers well-beyond the Oregon State University campus, and we’ve been developing the project with that goal in mind ever since.

The bulk of the curriculum is devoted to an abbreviated survey of Pauling’s life and work, presented in chronological order, and grouped under the following headings:

Throughout these sections, we’ve linked to any applicable objects that have already been digitized in support of our various Documentary History and Primary Source websites.

The curriculum also includes a series of instructions on “rules for research” in an archive.  We feel that this is especially important given the youth of our target audience, and hope that it will likewise provide for a positive introduction to the in’s and out’s of conducting scholarship with primary sources — an oftentimes intimidating process for researchers at any level.

The website itself is built with TEI Lite, which we’re using more and more in support of small but clean webpages that can be created and released comparatively quickly.   Though we’ve used the TEI (Text Encoding Initiative) Lite standard for numerous transcripts projects in the past, the first of our sites to be built entirely in TEI Lite was the biographical essay “Bernard Malamud: An Instinctive Friendship,” written by Chester Garrison and posted on our Bernard Malamud Papers page last month.  Plans for several additional TEI Lite-based “microsites” are currently in the works.

TEI Lite is a terrific tool in part because it is very simple to use.  In the example of the curriculum, all of the text, images, administrative metadata and much of the formatting that appears on the finished site are encoded in easily-learned and interpreted tags.  (We used XSL to generate the table of contents and to standardize the page formatting — e.g., where the images sit on a page and how the captions render.)  As a result, most of the mark-up required for these projects is at least roughed out by our student staff, which makes for a pretty efficient workflow within the department.

An example of the TEI Lite code for Page 2 of the Pauling Student Learning Curriculum is included after the jump.  We’ll be happy to answer any reader questions in the Comments to this post.

Continue reading

Creating The Pauling Catalogue: Page Design

Example text and image from The Pauling Catalogue scrapbooks series

Example text and image from The Pauling Catalogue scrapbooks series

[Part 7 of 9]

Once the publication’s text had been encoded and its illustrations selected, the next major challenge in creating The Pauling Catalogue was the actual design of the publication, page-by-page and volume-by-volume.  This process was carried out chiefly through the skillful implementation of Adobe’s InDesign software.

Having marked-up the raw text of the publication in XML, the catalogue’s textual content was ready to import into InDesign.  The result of this import, however, was a large mass of largely-unformatted text.  As much as possible, various characteristics were assigned to groups of text based upon a given group’s location along the xml heirarchy.  In this, specific sets of data were styled automatically through a pre-determined set of formatting rules specifying font, color and spacing rules.

The illustration below is an example of the output generated by this process.  The top level of the hierarchy for this series is the series itself, “Biographical.”  The second level of the heirarchy is the subseries, in this case “Personal and Family.”  The third level is a box title, and the fourth level is a folder title.  The illustration depicts the styling characteristics that were assigned to levels three and four of the hierarchy in the Biographical series — all box titles were formatted in red, all folder titles were formatted in black, and each had its own spacing rules.

Personal and Family

An example of the styling output in Biographical: Personal and Family

Formatting the publication’s illustrations was a significantly more complex proposition.  During its initial design phase, a placeholder image template was inserted on each page of The Pauling Catalogue. These templates consisted of three “boxes” meant to hold printing material — one box for the illustration, one box for the illustration’s catalogue identification number and one box for it’s caption — as well as two additional “boxes” in which non-printing design notes (an instructional note for the designer indicating where the image should be located on a given page, and an abbreviated description of the image used to generate illustration indices published at the back of each volume) were inserted. Identifier and caption information for each illustration was imported directly into these image templates from a series of master Excel spreadsheets. For pages not containing an illustration, the placeholder templates were later removed.

A representation of the multiple "boxes" that comprised the image template used for each of the 1,200+ illustrations in The Pauling Catalogue

A representation of the multiple "boxes" comprising the image template used to format each of the 1,200+ illustrations incorporated into The Pauling Catalogue.

A great deal of image correction was likewise conducted to remove flaws — dust speckles, for instance — from the selected illustrations.

A particularly extreme example of the image correction often required in the formatting of the illustrations used in The Pauling Catalogue.

A particularly extreme example of the image correction often required in the formatting of the illustrations used in The Pauling Catalogue.

A few original graphics were also created for this project, most notably the Pauling Catalogue badge designed for presentation on the cover of each volume.

The Pauling Catalogue badge and the Pauling Papers logo -- both are used as design elements throughout The Pauling Catalogue.

The Pauling Catalogue badge and the Pauling Papers logo -- both are used as design elements throughout The Pauling Catalogue.

Manual corrections were made to minimize “widows” and “orphans,” and a few additional manual changes were made directly in InDesign to correct small problems that would not be efficient to address in XML or XSL.

The Pauling Catalogue

The Pauling Catalogue

All design decisions were made with the overarching, two-pronged goal of this project kept in mind: 1) to disseminate scholarly information in a clean and useable manner and 2) to create a product that is aesthetically pleasing, browseable and of interest to a broad audience. While the primary market for The Pauling Catalogue is presumed to be academic libraries and history departments, we feel that the finished product is likewise at home on the coffee table or living room book shelf.

The Pauling Catalogue is available for purchase at http://paulingcatalogue.org