An Expanded and Improved Pauling Awards Site

Pauling receiving the Priestley Medal, 1984.

It is with great pleasure that we announce the release of a revised and expanded version of the website Linus Pauling: Honors, Awards and Medals.   This new iteration of the site includes well over 600 images of nearly all of the 460 awards that Pauling received over the course of his 70+ years in science. (as well as the nine awards that he was given after he died)

Indeed, Pauling was a well-decorated individual, the recipient of 47 honorary doctorates and just about every important award that a scientist can get.  He started early: in 1931 he was the first winner of the A. C. Langmuir Prize, given by the American Chemical Society to the best young chemist in the nation.  Two years later he was the youngest person, at the time, to be inducted into the National Academy of Sciences.

The volume of awards that he received was so great that, on the surface, some appear to contradict others.  For example, he received the Presidential Medal for Merit in 1948 for the scientific work (including new rocket propellants and explosives) that he conducted on behalf of the Allied effort during World War II.  Thirteen years later, in 1961, he was named Humanist of the Year by the American Humanist Association.

He also received honors from organizations around the world: the Humphry Davy Medal from the Royal Society in 1947, the Amedeo Avogadro Medal from the Italian National Academy in 1956, the Lomonosov Medal from the Soviet Academy in 1978.  And he graciously accepted decorations from slightly lower profile organizations as well, including (our favorite) an honorary black belt from the All Japan Karate-Do Federations in 1980.

Linus Pauling: peace activist and honorary black belt, 1980.

He remains, of course, history’s only recipient of two unshared Nobel Prizes.

The Pauling Awards site was originally released in 2004 as a CONTENTdm collection.  In the years that followed, the talented student staff of the Special Collections & Archives Research Center photographed many more items that did not make it onto the 2004 release and also rephotographed artifacts that weren’t captured in exceptional quality the first time around.

Since 2004 many of our web projects have also moved to a METS and MODS based metadata platform, so the desire to add the new and improved image content to the Awards site dovetailed nicely with the desire to describe increasing percentages of our content in METS records.  (We talked a lot about METS and MODS in this series of posts from 2008 and 2009)

One exciting new technical innovation developed for the Pauling Awards revamp was the automated batch generation of METS records using the XSL scripting language.  In the past, all of our METS records have been created by hand. But because the Awards series in the Pauling finding aid is described on the item level, it was possible to develop scripts that would pull the item-level data out of the XML files in which the series has been encoded and create METS records batch generated by machines.  This batch of automated records did require a small amount of clean-up by our resident humans, but the process was hugely time efficient relative to creating each record by hand.  Because multiple components of the Pauling finding aid (like the photographs) are described on the item level, a batch process similar to what was developed for the Awards site may come into use again for future digital collections.

The interface for the 2011 version of the Pauling Awards site is also hugely improved over the 2004 version.  As with all of the METS-based websites that we have released over the years, the Awards site was designed using XSL and CSS, a process that allows for maximum flexibility.  As a result, users are now able to navigate the digital collection much more easily than was previously the case.  The item level metadata is better now too, allowing for improved alternative navigation, such as this subject view.

For more on the Pauling Awards site, see this press release, which, among other things, discusses some of the site’s new navigation features in greater depth.

Advertisements

The Building Blocks of Linus Pauling Day-by-Day, Part 2: XSLT

In our last post we discussed the four XML pieces that form the content of Linus Pauling Day-by-Day. Once properly formatted, these disparate elements are then combined into a cohesive, web-ready whole using a set of rules called Extensible Stylesheet Language Transformations (XSLT). This post will delve more deeply into XSLT and how we use it on the Linus Pauling Day-by-Day site.

Creating the Various HTML Pages

A single XSLT  file controls the majority of the actions and templates that make up Linus Pauling Day-by-Day. Several Windows batch files, one for each year in the Calendar, control the calling of the calendar stylesheet with the calendar year XML. After some initial processing, the stylesheet begins to output all of the various pages. First, the main Calendar Homepage is created. This can be redundant when processing multiple years in a row, but ensures that the statistics and links are always up-to-date. Second, the Year Page is created. This pulls in extra data from the Pauling Chronology TEI file, specific to the year, displays a summary of the Paulings’ travel for that year, provides a snapshot image for the year, and lastly presents the activities for the year that don’t have a more specific date. An additional page is created that provides a larger image and some metadata for the snapshot document.

Then, acting on the months present in the data file, a Month Page is created. This presents the first calendar grid (how this is done is explained in the next section) followed by all of the days in that month that have activities and document data. If there is data about where the Paulings traveled, that is displayed in the calendar grid to give a visual overview of their itineraries. Finally, a Day Page is created for each day present within a month. The stylesheet simply acts on the days listed in the calendar XML file, and does not try to figure out how many days are in a given month in a given year. This style of acting on the data provided, instead of always doing things to a certain size or range, is a part of the programming paradigm of functional programming.   The Day Page features a large thumbnail of a document or photo, a smaller calendar grid, with travel information for the day displayed below if present, and then the activities and documents for that day.

fig. 1 Day Page view

fig. 1 Day Page view

Another page is then created with larger images of the document, additional pages if present, and some metadata about the document. As most of the output pages have a similar look and feel, there is a set of templates that handle the calendar navigation menu. Depending on what context the navigation is needed for (Year Page, Month Page, etc.), the output can adjust accordingly.

Computing the Calendar Grids

An important part of the look and feel of the project from early in the planning stage was to have familiar and user-friendly navigation for the large site, which meant that the traditional calendar grid would play a big role.

fig. 1 Calendar Grid example, including travel

fig. 2 Calendar Grid example, including travel

However, the data that we’ve amassed at the day level doesn’t have any information like the day of the week.  Beyond the 7 days of the week that January 1 could fall on, the grids for each month are complicated by whether or not a year is a leap year, resulting in 14 combinations that don’t occur in a regular pattern. If you’ve ever looked at a perpetual calendar, you’ll know that the grids are deceptive in how simple they look.

An initial design goal was to have the calendar stylesheet be able to handle all of the grid variations with only minimal help. We didn’t really want to have to prepare and store lots of information for each month of each year that the project would encompass. For each year, all the stylesheet needs is the day of the week that January 1 falls on (a number 1-7, representing Sunday-Saturday in our case), which is stored in the year XML, and then it can take that and the year and figure out all of the month grids. Fitting the algorithm for this into the XSLT stylesheet code is one of the more complex coding projects we’ve worked on.

It took several hundred lines of code, but we haven’t needed to mess with it since it was first written, even as we’ve added years and expanded the project. With the help of new features in XSLT version 2 and several more years of experience, the code could be rewritten to be cleaner and more efficient. However, because it was so reliable, time on stylesheet development was spent elsewhere.

fig. 1 Travel summary display for 1954

fig. 3 Travel summary display for 1954

Travel Summary Display

In the calendar XML data, each day can contain a basic <travel> element that states where the Paulings were on that day. This information is gleaned mostly from travel itineraries and speaking engagements, but also from correspondence or manuscript notes, and usually represents a city. On the Year page, we wanted to present a nice summary for the year of where the Paulings traveled to, and how long their trips were. Because the data was spread across days, and not already grouped together, it was a challenge at first to get trip totals that were always accurate and grouped appropriately. Using the new grouping features in XSLT version 2, the various trips could be grouped together appropriately, and then displayed in order. A range could be computed using the first and last entry of the group, and linked to the day page for the first entry. Now if you see an interesting place that the Paulings visited, you can go directly to the first day they were there, and see what they were doing that day. If more than one day was spent in a given place, a total is displayed showing how many days they were there.

METS for Documents and Images

The last post covered what METS records are and how we are using them. Because the files are fairly complex and have additional data that we aren’t using, the calendar stylesheet abstracts away the unneeded info and complexity. A temporary data structure is used to store the data needed, and then the calendar stylesheet refers to that in its templates, instead of dealing directly with the METS record and the descriptive MODS metadata. This approach is also used in the stylesheets for our Documentary History websites, and portions of the code were able to be repurposed for the calendar stylesheet.

Transcripts

As covered in the last post, the transcripts are stored in individual TEI Lite XML files. In our calendar year XML data, a simple <transcript> element added to a listing conveys the ID of the transcript file. The stylesheet can then take this ID, retrieve the file, apply the formatting templates, and output the result to the HTML pages. We use XML Namespaces to keep the type of source documents separate, and then the XSLT stylesheet can apply formatting to only a specific one. So, if we wanted to apply some styling to the title element, we could make sure that it was only the title element from a TEI file, not from a METS file. This allows us to have a group of formatting templates for TEI Lite files in a separate XSLT file, which can be imported by the calendar stylesheet, and none of it’s rules and templates will affect anything but the TEI Lite files. Since the code for the TEI Lite files and transcripts was already written for earlier projects, very little stylesheet code (less than 10 lines) was needed to add the ability to display nice, formatted transcripts.

The Building Blocks of Linus Pauling Day-by-Day

The technical workflow of Linus Pauling Day-by-Day

fig. 1 The technical workflow of Linus Pauling Day-by-Day

Any given page in the Linus Pauling Day-by-Day calendar is the product of up to four different XML records.  These records describe the various bits of data that comprise the project – be they document summaries, images or full-text transcripts.  The data contained in the various XML records are then interpreted by XSL stylesheets, which redistribute the information and generate local HTML files as their output.  The local HTML is, in turn, styled using CSS and then uploaded to the web.

Got all that?

In reality, the process is not quite as complicated as it may seem.  Today’s post is devoted to describing the four XML components that serve as building blocks for the calendar project.  Later on this week, we’ll talk more about the XSL side of things. (For some introductory information on XML and XSL, see this post, which discusses our use of these tools in creating The Pauling Catalogue)

Preliminary work in WordPerfect

The 68,000+ document summaries that comprise the meat of Linus Pauling Day-by-Day have been compiled by dozens of student assistants over the past ten years.  Typically, each student has been assigned a small portion of the collection and is charged with summarizing, in two or three sentences, each document contained in their assigned area.  These summaries have, to date, been written in a series of WordPerfect documents.

The January 30, 2009 launch of Linus Pauling Day-by-Day is being referred to, internally, as Calendar 1.5.  This is in part because of several major workflow changes that we have on tap for future calendar work, a big part of it being a movement out of WordPerfect.  While the word processing approach has worked pretty well for our students – it’s an interface with which they’re familiar, and includes all the usual advantages of a word processing application (spellchecking, etc.) – it does present fairly substantial complications for later stages of the work flow.

For one, everything has to be exported out of the word processing documents and marked-up in XML by hand.  For another, the movement out of a word processor and into xml often carries with it issues related to special characters, especially “&,” “smart quotes” and “em dash,” all of which can play havoc with certain xml applications.

Our plan for Calendar 2.0 is to move out of a word processing interface for the initial data entry in favor of an XForms interface, but that’s fodder for a later post.

The “Year XML”

Once a complete set of data has been compiled in WordPerfect, the content is then moved into XML.  All of the event summaries that our students write are contained in what might be called “Year XML” records.  An example of the types of data that are contained in these XML files is illustrated here in fig. 2.  Note that the information in the fig. 2 slide is truncated for display purposes – all of the “—-” markers represent text that has been removed for sake of scaling the illustration – but that generally speaking, the slide refers to the contents of the January 2, 1940 and August 7, 1940 Day-by-Day pages, the latter of which will also serve as our default illustrations reference.

Cursory inspection of the “Year XML” slide reveals one of the mark-up language’s key strengths – it’s simplicity.  For the most part, all of the tags used are easily-understandable to humans and the tag hierarchy that organizes the information follows a rather elementary logic.  The type of record is identified using <calendar>, months are tagged as <month>, days are tagged as <day> and document summaries are tagged as <event>.

The one semi-befuddling aspect of the “Year XML” syntax is the i.d. system used in reference to illustrations and transcripts.  After much experimentation, we have developed an internal naming system that works pretty well in assigning unique identifiers to every item in the Pauling archive.  The system is primarily based upon a document’s Pauling Catalogue series location and folder identifier, although since the Catalogue is not an item-level inventory (not completely, anyway) many items require further description in their identifier.  In the most common case of letters, the further description includes identifying the correspondents and including a date.

Fig. 2 provides an example of three identifiers.  The first is <record><id series=”09″>1940i.38</id></record>, which is the “Snapshot” reference for the 1940 index page.  This identifier is relatively simple as it defines a photograph contained in the Pauling Catalogue Photographs and Images series (series 09), the entirety of which is described on the item level.  So this XML identifier utilizes only a series notation (“09”) and a Pauling Catalogue notation (1940i.38).

The two other examples in Fig. 2 are both letters.  The first is <record><id series=”01″>corr136.9-lp-giguere-19400102</id></record>, a letter from Linus Pauling to Paul Giguere located in the Pauling Catalogue Correspondence series (series 01) in folder 136.9, and used on Day-by-Day as the illustration for the first week of January, 1940.  Because the folder is not further described on the item level, there exists a need for more explication in the identifier of this digital object.  Hence the listing of the correspondents involved and the date on which the letter was written.

The second example is a similar case: <record><id series=”11″>sci14.038.9-lp-weaver-19400807</id></record>, used as the Day-by-Day illustration for the first full week of August 1940.  In this instance, however, the original document is held in Pauling Catalogue series 11, Science, and is a letter written by Pauling to Warren Weaver on August 7, 1940.

METS Records to Power the Illustrations

We’ve talked about METS records a few times in the past, and have defined them as “all-in-one containers for digital objects.”  The Pauling to Weaver illustration mentioned above is a good example of this crucial piece of functionality, in that it is used as a week illustration in the August 1940 component of the Day-by-Day project, and is also a supporting document on the “It’s in the Blood!” documentary history website.  Despite its dual use, the original document was only ever scanned once and described once in METS and MODS.  Once an item is properly encoded in a METS record, it becomes instantly available for repurposing throughout our web presence.

Just about everything that we need to know about a scanned document is contained in its METS record.  In the case of Day-by-Day, we can see how various components of the Pauling to Weaver METS record are extracted to display on two different pages of the project.  Fig. 3 is a screenshot of this page, the “Week Index View” for the August 7, 1940 Day-by-Day page (all of the days for this given week will display the same illustration, but will obviously feature different events and transcripts listings).  Fig. 4 is a screenshot of the “Full Illustration View,” wherein the user has clicked on the Week Index illustration and gained access to both pages of the letter, as well as a more-detailed description of its contents.

Below (fig. 5) is an annotated version of the full METS record for the Pauling to Weaver letter.  As you’ll note once you click on it, fig. 5 is huge, but it’s worth a look in that, among other details, it gives an indication of how different components of the record are distributed to different pages. For example:

  • The Object Title, “Letter from Linus Pauling to Warren Weaver,” which displays in both views.
  • The Object Summary, “Pauling requests that Max Perutz…,” which displays only in the Full Illustration View.
  • The Object Date, used in both views.
  • The Local Object Identifier, sci14.038.9-lp-weaver-19400804, which displays at the bottom of the Full Illustration View.
  • The Page Count, used only in the Week Index View.
  • Crucially, the 400 pixel-width jpeg images, which are stored in one location on our web server (corresponding, again, with Pauling Catalogue series location), but in this example retrieved for display only in the Week Index View.
  • And likewise, the 600 pixel-width jpeg images, which are retrieved for Day-by-Day display in the Full Illustration View, but also used for reference display in the Documentary History projects.
fig. 5 An annotated version of the full METS record for digital object sci14.038.9-lp-weaver-19400804

fig. 5 An annotated version of the full METS record for digital object sci14.038.9-lp-weaver-19400804

An additional word about the illustrations used in Linus Pauling Day-by-Day

One of the major new components of the “1.5 Calendar” launch is full-page support for illustrations of ten pages or less – in the 1.0 version of the project, only the first page of each illustration was displayed, no matter the length of the original document.  Obviously this is a huge upgrade in the amount and quality of the content that we are able to provide from within the calendar.  The question begs to be asked, however, “why ten pages or less?”

In truth, the ten pages rule is somewhat arbitrary, but it works pretty well in coping with a major scaling problem that we face with the Day-by-Day project.  Users will note that the “Full Illustration View” for all Day-by-Day objects presents the full page content (when available) on a single html page, as opposed to the cleaner interface used on our Documentary History sites.  There’s a good reason for this.  In the instance of the Documentary History interface, essentially two html pages are generated for every original page of a document used as an illustration: a reference view and a large view.  This approach works well for the Documentary History application, in that even very large objects, such as Pauling’s 199-page long Berkeley Lectures manuscript, can be placed on the web without the size of a project exploding out of control – the Berkeley Lectures comprise 398 html pages, which is a lot, but still doable.

Linus Pauling Day-by-Day, on the other hand, currently requires that the full complement of images theoretically comprising an illustration be used, specifically, for each unique day of the week for which an image is chosen.  In other words, if the Berkeley Lectures were chosen to illustrate a week within the calendar, and the full content of the digital object were to be displayed for each day of that week using the same clean interface as a Documentary History, a sum total of 2,786 (199 x 2 x 7) html pages would need to be generated to accomplish the mission.  For that week only.  Obviously this is not a sustainable proposition. By contrast, the current version 1.5 approach always requires 7 html pages for each week, though full image support and super-clean display are sometimes sacrificed in the process.

Calendar 2.0 will deal with the issue using a database approach, but again, this is a different topic for a different time.

Last but not least, TEI Lite

We’ve discussed TEI Lite in the past as well and will not spend a great deal of time with it here, except to reiterate that it is a simple mark-up language that works well in styling full-text transcripts and other similar documents for the web.

There are nearly 2,000 TEI Lite documents included in Linus Pauling Day-by-Day, virtually all of them transcripts of letters sent or received by Linus Pauling.  Transcript references within the Year XML are illustrated in fig. 2 above – they follow the exact same naming convention as our METS records, except that the mets.xml suffix is replaced by tei.xml.  It is worth noting that rough drafts of most of the text that was eventually encoded in TEI for the Day-by-Day project were generated using OCR software.  And while OCRing has improved mightily over the years, it still does have its quirks, which is why some of you might find, for example, the lower-case letter l substituted for by the number 1 in a few of the transcripts currently online. (we’re working on it)

The TEI Lite mark-up for the Pauling to Weaver letter is illustrated in fig. 6, as is, in fig. 7, the encoding for the biographical chronology (written by Dr. Robert Paradowski) used on the 1940 index page.  Note, in particular, the use of <div> tags to declare a year’s worth of information in the Paradowski mark-up.  These tags were included as markers for the xsl stylesheet to pull out a given chunk of data to be placed on a given year’s index page.  The entire Paradowski chronology will be going online soon, and once again that project, as with Day-by-Day, will be generated from only this single record.

fig. 6 The TEI Lite mark-up for Pauling's August 7, 1940 letter to Warren Weaver.

fig. 6 The TEI Lite mark-up for Pauling's August 7, 1940 letter to Warren Weaver.

fig. 7 The TEI Lite mark-up for one year of the Paradowski chronology

fig. 7 The TEI Lite mark-up for one year of the Paradowski chronology

Custom XML, METS records and TEI Lite – these are the building blocks of Linus Pauling Day-by-Day.  Check back later this week when we’ll discuss the means by which the blocks are assembled into a finished website using XSL stylesheets.

“It’s in the Blood!” A Revised, METS-based Website

Pastel drawing of a hemoglobin molecule by Roger Hayward, 1964.

Pastel drawing of a hemoglobin molecule by Roger Hayward, 1964.

“It [hemoglobin] is a good substance from the standpoint of a chemist, because of its availability. All you need to do is to catch somebody, introduce a hypodermic needle and draw out a sample of blood. A standard victim of this practice, weighing perhaps 120 pounds (it’s easier to catch them small!) contains in the red corpuscles in his blood one and two-tenths pounds of hemoglobin.”
– Linus Pauling, 1966.

Some reasonably big news to share today. As announced here, we have launched a revised and expanded version of our 2005 release “It’s in the Blood!  A Documentary History of Linus Pauling, Hemoglobin and Sickle Cell Anemia.”

Similar to the revised version of our “Nature of the Chemical Bond” documentary history website, which was launched this past February, the “second edition” of “It’s in the Blood!” contains a ton more content: the final tally runs to 53 new letters, 458 pages of added manuscripts and papers, 18 new pictures and 11 new audio and video clips.  The metadata for all of the site’s content is drastically improved as well — a fact that is most immediately evident on the various Key Participants pages, which have been transformed from rather spartan affairs to content-rich resources like this page devoted to Harvey Itano.

Aside from the self-evident benefits of adding more content to our pages, revising the older documentary histories has also prompted our digitization work more in the direction of a uniform METS-based platform.  We’ll talk a lot more about them at a later time, but for now it’s sufficient to define METS records as all-in-one containers for digital objects.

We use METS (Metadata Encoding and Transmission Standard) and MODS (Metadata Object Description Schema – both are flavors of XML) not only to describe a scanned item in a qualitative sense, but also to define how the item displays on a page.

For example, the METS record that “powers” the hemoglobin molecule above includes an internal i.d., the date and creator of the record, the image caption and it’s copyright data, the creator of the image (Roger Hayward) and any individuals or organizations associated with it. (Linus Pauling, in this case, since the drawing was published in Pauling and Hayward’s The Architecture of Molecules.)  The record also stores the date of the item’s creation (1964), and the genre type of the original document. (We use the Library of Congress’ Basic Genre Terms for Cultural Heritage Materials as our genre authority.  The Hayward item is defined by BGTCHM as an “illustration.”)

The METS record also defines certain display characteristics that are then interpreted by the XSL stylesheets that build our HTML pages.  Again using our hemoglobin molecule as an example, the METS record which defines the object’s output declares that it can be displayed at one of four different sizes.  The 150-pixel width display is used for all images inserted as Narrative sidebar images (Hemoglobin is on this page), as well as all images aggregated onto a given All Documents and Media index page. (Hemoglobin is about 3/4 of the way down the Pictures and Illustrations index.)  A 400-pixel width version will be used in a revised version of our “Linus Pauling Day-by-Day” project, which we hope to launch later this year.  The 600-pixel width “reference images” display like this, and the 900-pixel width big kahunas look like this.

METS records take a while to create, but the payoff is well worth the effort.  The flexibility that METS provides both within and across projects is of huge importance to us — when building really big websites and/or multiple websites with subject matter that tends to overlap, (the documentary histories, the Day-by-Day calendar and the Pauling Student Learning Curriculum, e.g.) it is way more efficient to be able to describe an object once but use it again and again.

Right now, “Linus Pauling and the Race for DNA” is the last of our documentary history sites still requiring METS attentions.  Once it’s revised, we’ll be able to start thinking concretely about providing different types of portal views and search tools for our growing METS cache (well over 3,000 records currently), an eventuality that promises a whole new range of possibilities for our entire digitization workflow.  But that’s a different topic for a different day…