New Image Search and Catalogue Pages

Continuing the theme from our last post, Redesigning our Web Presence, here is a closer look at how we built the new Image Search feature as well as what it takes to create the catalogue pages showing the detailed holdings of our collections.

New Image Search Feature

Our main web search feature, now on all of our pages, is provided by Oregon State’s campus search engine running the Nutch software. This software works similar to Google and has web crawlers that find web pages and indexes them for searching. Each Documentary History website and Linus Pauling Day-by-Day have a search box that limits results to the pages and items on only that site.

However, the main search feature on the OSU Libraries homepage does not use the campus engine, and our websites and digital objects were not included. The Library’s search feature is powered by LibraryFind, built by a team at OSU using the Ruby on Rails web application software. LibraryFind harvests and indexes many records and data sources, but does not crawl web pages. We needed to get our records harvested by and indexed into LibraryFind.

Since our digital object records are stored using the METS format with MODS metadata, it was fairly easy to convert this to Dublin Core metadata, which can then be served up by an OAI-PMH provider. LibraryFind then checks our OAI-PMH provider and harvests our digital object metadata for indexing into LibraryFind. Thanks to Terry Reese for his assistance on setting up the provider software and for his work on LibraryFind.

One piece of metadata that our METS records didn’t have that we needed to add was a URL where the digital object sits once it’s part of our Documentary History websites or Linus Pauling Day-by-Day. Our own scripts and stylesheets didn’t need this information, since they built the pages, but LibraryFind needed the URL to know where to provide a link to from search results that included our digital objects. This required adding:

<mods:location><mods:url access="object in context" displayLabel="Linus Pauling and the Race for DNA: A Documentary History. Pictures and Illustrations.">
http://scarc.library.oregonstate.edu/coll/pauling/dna/pictures/1948i.61.html
</mods:url></mods:location>

as appropriate to our metadata files.

Now that our digital objects are harvested into LibraryFind, they are included in results for searches done from the Library’s homepage. We can also limit the search results to just our materials, and by turning on the image results view, we get our new Image Search functionality. Now it is easy to search for digitized materials that are online, including photographs, scans of documents, and more.

Image Search results, powered by LibraryFind

Image Search results, powered by LibraryFind

In the future, we plan to include more of our digital objects and increase support for complex items and multimedia files.

Building Catalogue Pages from EAD

Each of our collections are stored in Encoded Archival Description (EAD) XML files, which include collection information as well as detailed catalogue listings for most of our collections. Often these descriptions go down to the item level, and there are varying levels and hierarchies in use, depending on what was appropriate for the material. The EAD files are processed by XSLT stylesheet files, similar to the rest of our website, and ideally we’d like to have a single set of files that can handle the different description levels and that we don’t have to tweak for specific collections. Here is a snippet of EAD XML for the Pauling Photographs series:

<c02 level="file">
<did>
<container type="box">1933i</container>
<unittitle>Photographs and Images related to Linus Pauling, <unitdate>1933</unitdate>.</unittitle>
</did>

<c03 level="item">
<did>
<container type="folder">1933i.1</container> 
<unitid audience="internal">1311</unitid>
<unittitle>Linus Pauling at OSU (Oregon Agricultural College) to receive an honorary doctorate of science. Pictured from left are Dr. Marvin Gordon Neale, president of the University of Idaho, David C. Henny, Linus Pauling, Chancellor W. J. Kerr, and Charles A. Howard, state superintendent of public instruction in Oregon. June 5, 1933. "LP at OSU (OAC) honorary doctorate 1933" Photographer unknown. Black and white print courtesy of the Oregon State University Archives.</unittitle>
<unitdate>1933</unitdate>
</did>
</c03>

<c03 level="item">
<did>
<container type="folder">1933i.2</container>
<unittitle>D.C. Henny, C.A. Howard and Linus Pauling. Linus Pauling receiving an honorary degree from Oregon Agricultural College. Print courtesy of the Oregon State University Archives. Photographer unknown. Black and white print.</unittitle>
<unitdate>1933</unitdate>
</did>
</c03>

Our previous setup for processing catalogues was heavily modified from older EAD Cookbook stylesheet files and contained a large amount of custom code. One of the major sections we added was code that split up very large sections and box listings into separate web pages that were a reasonable size, instead of presenting hundreds of boxes or folders on each page. To accomplish this, the stylesheet used to first make a temporary file that built a high-level ‘menu’ of the catalogue, which had a range for each page that was to be created. The rest of the stylesheet would then use this menu file to determine what pages to create, and searched over the whole collection for the IDs of the start and end of each page. This was complicated by the fact that some box IDs are not easy to compare numerically, such as ‘1954h3.4’ or ‘NDi.2’. Usually the alphabetic characters were replaced with numbers or punctuation to facilitate comparing them against each other. This technique was not very efficient for large sections, and required lots of tweaking to be able to handle all the various box and folder IDs we have.

Also, separate stylesheet files had to be created to better handle the Pauling Papers since it was so much larger than anything else, which meant that it was a pain to maintain features in both sets of stylesheet files. For the newer set of catalogue XSLT stylesheet files, we took a few different approaches.

First, a lot of redundant code was eliminated through the use of more XSLT matching templates. These allow you to write one set of formatting code and reuse it whenever an element appears. This made it easier to work around EAD’s flexibility for container lists, so a series or box would get processed the same no matter where it was in the encoded hierarchy. Here are some examples of the EAD matching templates:

<xsl:template match="ead:unittitle">
<xsl:apply-templates/>
<xsl:if test="not(ends-with(., '.'))">.</xsl:if>
<xsl:text> </xsl:text>
<xsl:if test="following-sibling::ead:unittitle"><br/></xsl:if>
</xsl:template>

<xsl:template match="ead:unitid[@type]">
<xsl:choose>
<xsl:when test="@type = 'isbn'">
  <a class="nowrap" href="http://www.worldcat.org/search?q=isbn%3A{replace(., '-', '')}" title="Look up this title in WorldCat"><xsl:value-of select="."/></a></xsl:when>
<xsl:when test="@type = 'oclc'">
  <a class="nowrap" href="http://www.worldcat.org/search?q=no%3A{.}" title="Look up this title in WorldCat"><xsl:value-of select="."/></a></xsl:when>
<xsl:when test="@type = 'lcc'">
  <a class="nowrap" href="http://oasis.oregonstate.edu/search/c?SEARCH={replace(., ' ', '+')}" title="Look up this title in the OSU Libraries' catalog"><xsl:value-of select="."/></a></xsl:when>
<xsl:when test="@type = 'gdoc'">
  <a class="nowrap" href="http://oasis.oregonstate.edu/search/?searchtype=g&amp;searcharg={replace(replace(., ' ', '+'), '/', '%2F')}" title="Look up this title in the OSU Libraries' catalog"><xsl:value-of select="."/></a></xsl:when>
<xsl:otherwise><xsl:value-of select="."/></xsl:otherwise>
</xsl:choose>
</xsl:template>

Second, the code to break long sections into smaller pages for web display was redone, this time using the position of a box or item instead of the IDs. IDs are still used for display purposes, but the output is all based on position (such as 1-10, 11-20, etc.) This code is much cleaner since IDs are no longer directly compared. It’s also faster since it deals with the pages sequentially and doesn’t loop over the whole section every time a page is processed.

Example of new catalogue page of Pauling Correspondence

Example of new catalogue page of Pauling Correspondence

Third, instead of using HTML tables for the columns layout of catalogue pages, we switched to a CSS-based layout that approximates the look of columns and indents. This requires much less code in both the XSLT for processing and the output files.

Finally, all catalogues are processed by the same set of files, and separate ones for the Pauling Papers are no longer needed. This will enable us to make improvements faster, expanding our links to digital content and increasing the access options for our materials.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: