SCARC Has a New Virtual Home

We reported on something similar nearly three years ago and we’re happy to announce again today that the department behind the Ava Helen and Linus Pauling Papers – the Special Collections & Archives Research Center (SCARC) – now has a new and improved web presence.  As first revealed yesterday on our sister blog Speaking of History

It is with great pleasure that we announce the official launch of our new department website!  Please find it at and be sure to update your bookmarks away from the old University Archives and Special Collections sites, which will no longer be maintained…

(Click here to read the whole blog post, which offers a tour of all that is new and exciting on the site.)

The SCARC website took nearly ten months to complete from its initial conception to today’s reality, traveling through many different versions along the way.

The version 1 schematic of the proposed SCARC website.

The project was initiated out of need to present the contents of the former Special Collections and University Archives websites in one spot, the result of the two departments having administratively merged in September 2011.  While seemingly a daunting task, the process was made much easier by the fact that both departments had committed to Encoded Archival Description (EAD) as the platform on which archival collections would be prepared and published for consumption on the web.

Though the combined department serves as caretaker of 1,034 collections (and counting), this commitment to a uniform descriptive practice, in tandem with xsl programming skill within the department, allows now for the generation and presentation of more than 900 finding aids in a relatively painless manner.  Certain of the finding aids suffer from wonky encoding, but most of the description is currently presented in a clean and user-friendly fashion, and we’re working to clean up the few troublesome collections that remain.

A glimpse at a few collections from the M alphabetical sort.

The interplay between xml and xsl also allowed us to “tag” each collection with genres and themes, which could then be output as static html pages available to users who wish to browse the collections by a specific orientation.  In other words, if one is interested in viewing only collections containing photographs, that option is available. The same is true for collections containing audio, video or books, as well as collections focusing on university history, the history of science, natural resources, multiculturalism or local history.  And from the perspective of content description, the “tagging” process could not have been simpler.  The screenshot below provides a glimpse of the collections.xml file that guides the theme and genre sort, as well as the image and caption that illustrates each collection.  The simplicity of what you see is just another example of the power of xml (at least when placed in the hands of a strong xsl programmer).

The new SCARC site also made great use of the model presented by Linus Pauling Online in developing a series of portals that provide access to the great swath of digital content generated by Special Collections and University Archives over time. Beginning in the mid-1990s, both units began digitizing in mounting volume to the point where, in 2011, a huge amount of content was available 24/7 to many different types of user groups…if they were able to find it.  The new site organizes all of this digital content into collection development-specified portals meant to improve ease of access.

Linus Pauling Online

This approach was based on the example of Linus Pauling Online, a sleekly designed landing page for all of the Pauling-related digital content that had been generated by Special Collections over more than ten years worth of dedicated work.  Launched in January 2009, the beauty of Linus Pauling Online is its ability to provide “one stop shopping” for anyone interested in Pauling who happened upon our web space.  Now, a similar arrangement is offered to users seeking out digitized content related to the four primary collection development themes lying at the heart of SCARC’s mission.

The Oregon Multicultural Archives Digital Resources page.

The finalization of the SCARC website marks a major step forward for the department as it allows us now to present different types of intellectual content in a single space and using uniform tools.  In the near term, look for more digitized videos related to the history of Oregon State University or the Pacific Northwest.  Look also for more digitized and encoded manuscripts as well as deeper description of collections of all kinds.

Within SCARC the Pauling Papers are now one of 1,034 collections.  But fans of Pauling needn’t worry that his life and work will get lost in the shuffle.  Of SCARC’s ten professional staff, two FTE are devoted to history of science collections and plenty of ideas are in the works for new projects emanating out of the Pauling Papers.  Not least of these is the Pauling Blog, which will continue to provide weekly outreach related to history’s only recipient of two unshared Nobel Prizes.

New Image Search and Catalogue Pages

Continuing the theme from our last post, Redesigning our Web Presence, here is a closer look at how we built the new Image Search feature as well as what it takes to create the catalogue pages showing the detailed holdings of our collections.

New Image Search Feature

Our main web search feature, now on all of our pages, is provided by Oregon State’s campus search engine running the Nutch software. This software works similar to Google and has web crawlers that find web pages and indexes them for searching. Each Documentary History website and Linus Pauling Day-by-Day have a search box that limits results to the pages and items on only that site.

However, the main search feature on the OSU Libraries homepage does not use the campus engine, and our websites and digital objects were not included. The Library’s search feature is powered by LibraryFind, built by a team at OSU using the Ruby on Rails web application software. LibraryFind harvests and indexes many records and data sources, but does not crawl web pages. We needed to get our records harvested by and indexed into LibraryFind.

Since our digital object records are stored using the METS format with MODS metadata, it was fairly easy to convert this to Dublin Core metadata, which can then be served up by an OAI-PMH provider. LibraryFind then checks our OAI-PMH provider and harvests our digital object metadata for indexing into LibraryFind. Thanks to Terry Reese for his assistance on setting up the provider software and for his work on LibraryFind.

One piece of metadata that our METS records didn’t have that we needed to add was a URL where the digital object sits once it’s part of our Documentary History websites or Linus Pauling Day-by-Day. Our own scripts and stylesheets didn’t need this information, since they built the pages, but LibraryFind needed the URL to know where to provide a link to from search results that included our digital objects. This required adding:

<mods:location><mods:url access="object in context" displayLabel="Linus Pauling and the Race for DNA: A Documentary History. Pictures and Illustrations.">

as appropriate to our metadata files.

Now that our digital objects are harvested into LibraryFind, they are included in results for searches done from the Library’s homepage. We can also limit the search results to just our materials, and by turning on the image results view, we get our new Image Search functionality. Now it is easy to search for digitized materials that are online, including photographs, scans of documents, and more.

Image Search results, powered by LibraryFind

Image Search results, powered by LibraryFind

In the future, we plan to include more of our digital objects and increase support for complex items and multimedia files.

Building Catalogue Pages from EAD

Each of our collections are stored in Encoded Archival Description (EAD) XML files, which include collection information as well as detailed catalogue listings for most of our collections. Often these descriptions go down to the item level, and there are varying levels and hierarchies in use, depending on what was appropriate for the material. The EAD files are processed by XSLT stylesheet files, similar to the rest of our website, and ideally we’d like to have a single set of files that can handle the different description levels and that we don’t have to tweak for specific collections. Here is a snippet of EAD XML for the Pauling Photographs series:

<c02 level="file">
<container type="box">1933i</container>
<unittitle>Photographs and Images related to Linus Pauling, <unitdate>1933</unitdate>.</unittitle>

<c03 level="item">
<container type="folder">1933i.1</container> 
<unitid audience="internal">1311</unitid>
<unittitle>Linus Pauling at OSU (Oregon Agricultural College) to receive an honorary doctorate of science. Pictured from left are Dr. Marvin Gordon Neale, president of the University of Idaho, David C. Henny, Linus Pauling, Chancellor W. J. Kerr, and Charles A. Howard, state superintendent of public instruction in Oregon. June 5, 1933. "LP at OSU (OAC) honorary doctorate 1933" Photographer unknown. Black and white print courtesy of the Oregon State University Archives.</unittitle>

<c03 level="item">
<container type="folder">1933i.2</container>
<unittitle>D.C. Henny, C.A. Howard and Linus Pauling. Linus Pauling receiving an honorary degree from Oregon Agricultural College. Print courtesy of the Oregon State University Archives. Photographer unknown. Black and white print.</unittitle>

Our previous setup for processing catalogues was heavily modified from older EAD Cookbook stylesheet files and contained a large amount of custom code. One of the major sections we added was code that split up very large sections and box listings into separate web pages that were a reasonable size, instead of presenting hundreds of boxes or folders on each page. To accomplish this, the stylesheet used to first make a temporary file that built a high-level ‘menu’ of the catalogue, which had a range for each page that was to be created. The rest of the stylesheet would then use this menu file to determine what pages to create, and searched over the whole collection for the IDs of the start and end of each page. This was complicated by the fact that some box IDs are not easy to compare numerically, such as ‘1954h3.4’ or ‘NDi.2’. Usually the alphabetic characters were replaced with numbers or punctuation to facilitate comparing them against each other. This technique was not very efficient for large sections, and required lots of tweaking to be able to handle all the various box and folder IDs we have.

Also, separate stylesheet files had to be created to better handle the Pauling Papers since it was so much larger than anything else, which meant that it was a pain to maintain features in both sets of stylesheet files. For the newer set of catalogue XSLT stylesheet files, we took a few different approaches.

First, a lot of redundant code was eliminated through the use of more XSLT matching templates. These allow you to write one set of formatting code and reuse it whenever an element appears. This made it easier to work around EAD’s flexibility for container lists, so a series or box would get processed the same no matter where it was in the encoded hierarchy. Here are some examples of the EAD matching templates:

<xsl:template match="ead:unittitle">
<xsl:if test="not(ends-with(., '.'))">.</xsl:if>
<xsl:text> </xsl:text>
<xsl:if test="following-sibling::ead:unittitle"><br/></xsl:if>

<xsl:template match="ead:unitid[@type]">
<xsl:when test="@type = 'isbn'">
  <a class="nowrap" href="{replace(., '-', '')}" title="Look up this title in WorldCat"><xsl:value-of select="."/></a></xsl:when>
<xsl:when test="@type = 'oclc'">
  <a class="nowrap" href="{.}" title="Look up this title in WorldCat"><xsl:value-of select="."/></a></xsl:when>
<xsl:when test="@type = 'lcc'">
  <a class="nowrap" href="{replace(., ' ', '+')}" title="Look up this title in the OSU Libraries' catalog"><xsl:value-of select="."/></a></xsl:when>
<xsl:when test="@type = 'gdoc'">
  <a class="nowrap" href=";searcharg={replace(replace(., ' ', '+'), '/', '%2F')}" title="Look up this title in the OSU Libraries' catalog"><xsl:value-of select="."/></a></xsl:when>
<xsl:otherwise><xsl:value-of select="."/></xsl:otherwise>

Second, the code to break long sections into smaller pages for web display was redone, this time using the position of a box or item instead of the IDs. IDs are still used for display purposes, but the output is all based on position (such as 1-10, 11-20, etc.) This code is much cleaner since IDs are no longer directly compared. It’s also faster since it deals with the pages sequentially and doesn’t loop over the whole section every time a page is processed.

Example of new catalogue page of Pauling Correspondence

Example of new catalogue page of Pauling Correspondence

Third, instead of using HTML tables for the columns layout of catalogue pages, we switched to a CSS-based layout that approximates the look of columns and indents. This requires much less code in both the XSLT for processing and the output files.

Finally, all catalogues are processed by the same set of files, and separate ones for the Pauling Papers are no longer needed. This will enable us to make improvements faster, expanding our links to digital content and increasing the access options for our materials.