Paradowski’s Pauling Chronology

Ava Helen, Linus Jr. and Linus Pauling, 1930.

Ava Helen, Linus Jr. and Linus Pauling, 1930.

According to Zelek S. Herman ‘the best biography in my opinion is the short one by Pauling’s only authorized biographer, Robert J. Paradowski who has extensively studied Pauling’s scientific work and who knew him for many years.’ Unfortunately this biography/chronology, though written in English, was published in a hard-to-find Japanese volume. Paradowski’s 1972 University of Wisconsin Ph.D. thesis titled The Structural Chemistry of Linus Pauling is the only compulsively readable thesis that has come my way in a lifetime of sitting on M.S. and Ph.D. final orals. Subsequently Paradowski got to know Pauling intimately, was anointed as his authorized biographer, and has spent the last 20 years accumulating a unique store of knowledge concerning his subject. Since Paradowski writes well and has unusually catholic interests, the book – or rather books, since it will be in at least three volumes – should be well worth waiting for. But for how long? Paradowski tells me he definitely expects completion by the centennial year, 2001…. I hope I live to read it!

Derek Davenport, “The Many Lives of Linus Pauling: A Review of Reviews,” Journal of Chemical Education, 73 (9): A210.  September 1996.

Every year, in celebration of Linus Pauling’s birthday anniversary, we try to release a new project either on or near February 28th.  Nine years ago, we participated in the mounting of a small plaque in Oregon State University’s Education Hall Room 201, which identified the location where Linus and Ava Helen Pauling first met in 1922.  The next year marked the Pauling Centenary and a large day-long conference was held in honor of the occasion.  Since then, most of our birthday releases have come in the form of websites – Linus Pauling Research Notebooks in 2002, The Race for DNA in 2003, Awards, Honors and Medals in 2004, and an expanded version of Linus Pauling and the Nature of the Chemical Bond in 2008.

The 2009 release is Robert Paradowski‘s The Pauling Chronology, a 27-page TEI-based web resource, the content of which Derek Davenport referenced in the 1996 quote above.  The Chronology is being advertised as “the most detailed overview of Linus Pauling’s ancestry, life and work available on the web,” a statement that we feel can be made without hesitation.  As Davenport notes, Paradowski enjoyed the most-unfettered access to Pauling of any of his at-least four biographers, and has compiled what is surely the most extensive compilation of biographical interviews conducted with Pauling and his associates.

Consquently, it is chiefly Paradowski upon whom we must rely to fill in certain scarcely-documented eras of Pauling’s life, particularly his early years as a boy in Condon and Portland, and as a young man at Oregon Agricultural College.

As a web resource, the Chronology likewise addresses certain major topics that have yet to be properly explored by other members of the Pauling online collective (ourselves, of course, included).  Pauling’s historic program of research on the structure of proteins, for instance, while touched upon in The Race for DNA, has not yet received the attention that it deserves, at least in terms of an Internet presence.  The Chronology helps to remedy this situation.  Even moreso, Pauling’s controversial infatuation with vitamin C, as well as the unsteady early history of the institute which bears his name, receive a fair and thorough treatment in Paradowski’s write-up.

Paradowki’s knowledge of subject and skill as a writer shine through in his Chronology, traits which lend ever-increasing urgency to Davenport’s crucial question, “But for how long?”  The centennial year has clearly come and gone with no major biography published and no hints that it might soon be on the way.  No hints, that is, except, perhaps, this passage that Dr. Paradowski used to close the talk that he gave at the 2001 Centenary conference:

Toward the end of [Pauling’s] life, someone had sent him a Bible with gigantic print.  He was having trouble with his vision, but because this book had such gigantic print he was reading the Bible again….  And perhaps he ran into this passage in Ecclesiastes, and I’d like to quote it as my finish:  “Let us search then like those who must find, and find like those who continue to search, for it is written the man who has reached the end is only beginning.”  And that’s the way I feel about my work on Pauling – no matter how much I do, it seems like I’m just beginning.

The Pauling Chronology and the transcribed video of a 1995 talk by Dr. Robert Paradowski are both available via the Linus Pauling Online portal.

The Building Blocks of Linus Pauling Day-by-Day, Part 2: XSLT

In our last post we discussed the four XML pieces that form the content of Linus Pauling Day-by-Day. Once properly formatted, these disparate elements are then combined into a cohesive, web-ready whole using a set of rules called Extensible Stylesheet Language Transformations (XSLT). This post will delve more deeply into XSLT and how we use it on the Linus Pauling Day-by-Day site.

Creating the Various HTML Pages

A single XSLT  file controls the majority of the actions and templates that make up Linus Pauling Day-by-Day. Several Windows batch files, one for each year in the Calendar, control the calling of the calendar stylesheet with the calendar year XML. After some initial processing, the stylesheet begins to output all of the various pages. First, the main Calendar Homepage is created. This can be redundant when processing multiple years in a row, but ensures that the statistics and links are always up-to-date. Second, the Year Page is created. This pulls in extra data from the Pauling Chronology TEI file, specific to the year, displays a summary of the Paulings’ travel for that year, provides a snapshot image for the year, and lastly presents the activities for the year that don’t have a more specific date. An additional page is created that provides a larger image and some metadata for the snapshot document.

Then, acting on the months present in the data file, a Month Page is created. This presents the first calendar grid (how this is done is explained in the next section) followed by all of the days in that month that have activities and document data. If there is data about where the Paulings traveled, that is displayed in the calendar grid to give a visual overview of their itineraries. Finally, a Day Page is created for each day present within a month. The stylesheet simply acts on the days listed in the calendar XML file, and does not try to figure out how many days are in a given month in a given year. This style of acting on the data provided, instead of always doing things to a certain size or range, is a part of the programming paradigm of functional programming.   The Day Page features a large thumbnail of a document or photo, a smaller calendar grid, with travel information for the day displayed below if present, and then the activities and documents for that day.

fig. 1 Day Page view

fig. 1 Day Page view

Another page is then created with larger images of the document, additional pages if present, and some metadata about the document. As most of the output pages have a similar look and feel, there is a set of templates that handle the calendar navigation menu. Depending on what context the navigation is needed for (Year Page, Month Page, etc.), the output can adjust accordingly.

Computing the Calendar Grids

An important part of the look and feel of the project from early in the planning stage was to have familiar and user-friendly navigation for the large site, which meant that the traditional calendar grid would play a big role.

fig. 1 Calendar Grid example, including travel

fig. 2 Calendar Grid example, including travel

However, the data that we’ve amassed at the day level doesn’t have any information like the day of the week.  Beyond the 7 days of the week that January 1 could fall on, the grids for each month are complicated by whether or not a year is a leap year, resulting in 14 combinations that don’t occur in a regular pattern. If you’ve ever looked at a perpetual calendar, you’ll know that the grids are deceptive in how simple they look.

An initial design goal was to have the calendar stylesheet be able to handle all of the grid variations with only minimal help. We didn’t really want to have to prepare and store lots of information for each month of each year that the project would encompass. For each year, all the stylesheet needs is the day of the week that January 1 falls on (a number 1-7, representing Sunday-Saturday in our case), which is stored in the year XML, and then it can take that and the year and figure out all of the month grids. Fitting the algorithm for this into the XSLT stylesheet code is one of the more complex coding projects we’ve worked on.

It took several hundred lines of code, but we haven’t needed to mess with it since it was first written, even as we’ve added years and expanded the project. With the help of new features in XSLT version 2 and several more years of experience, the code could be rewritten to be cleaner and more efficient. However, because it was so reliable, time on stylesheet development was spent elsewhere.

fig. 1 Travel summary display for 1954

fig. 3 Travel summary display for 1954

Travel Summary Display

In the calendar XML data, each day can contain a basic <travel> element that states where the Paulings were on that day. This information is gleaned mostly from travel itineraries and speaking engagements, but also from correspondence or manuscript notes, and usually represents a city. On the Year page, we wanted to present a nice summary for the year of where the Paulings traveled to, and how long their trips were. Because the data was spread across days, and not already grouped together, it was a challenge at first to get trip totals that were always accurate and grouped appropriately. Using the new grouping features in XSLT version 2, the various trips could be grouped together appropriately, and then displayed in order. A range could be computed using the first and last entry of the group, and linked to the day page for the first entry. Now if you see an interesting place that the Paulings visited, you can go directly to the first day they were there, and see what they were doing that day. If more than one day was spent in a given place, a total is displayed showing how many days they were there.

METS for Documents and Images

The last post covered what METS records are and how we are using them. Because the files are fairly complex and have additional data that we aren’t using, the calendar stylesheet abstracts away the unneeded info and complexity. A temporary data structure is used to store the data needed, and then the calendar stylesheet refers to that in its templates, instead of dealing directly with the METS record and the descriptive MODS metadata. This approach is also used in the stylesheets for our Documentary History websites, and portions of the code were able to be repurposed for the calendar stylesheet.

Transcripts

As covered in the last post, the transcripts are stored in individual TEI Lite XML files. In our calendar year XML data, a simple <transcript> element added to a listing conveys the ID of the transcript file. The stylesheet can then take this ID, retrieve the file, apply the formatting templates, and output the result to the HTML pages. We use XML Namespaces to keep the type of source documents separate, and then the XSLT stylesheet can apply formatting to only a specific one. So, if we wanted to apply some styling to the title element, we could make sure that it was only the title element from a TEI file, not from a METS file. This allows us to have a group of formatting templates for TEI Lite files in a separate XSLT file, which can be imported by the calendar stylesheet, and none of it’s rules and templates will affect anything but the TEI Lite files. Since the code for the TEI Lite files and transcripts was already written for earlier projects, very little stylesheet code (less than 10 lines) was needed to add the ability to display nice, formatted transcripts.

The Building Blocks of Linus Pauling Day-by-Day

The technical workflow of Linus Pauling Day-by-Day

fig. 1 The technical workflow of Linus Pauling Day-by-Day

Any given page in the Linus Pauling Day-by-Day calendar is the product of up to four different XML records.  These records describe the various bits of data that comprise the project – be they document summaries, images or full-text transcripts.  The data contained in the various XML records are then interpreted by XSL stylesheets, which redistribute the information and generate local HTML files as their output.  The local HTML is, in turn, styled using CSS and then uploaded to the web.

Got all that?

In reality, the process is not quite as complicated as it may seem.  Today’s post is devoted to describing the four XML components that serve as building blocks for the calendar project.  Later on this week, we’ll talk more about the XSL side of things. (For some introductory information on XML and XSL, see this post, which discusses our use of these tools in creating The Pauling Catalogue)

Preliminary work in WordPerfect

The 68,000+ document summaries that comprise the meat of Linus Pauling Day-by-Day have been compiled by dozens of student assistants over the past ten years.  Typically, each student has been assigned a small portion of the collection and is charged with summarizing, in two or three sentences, each document contained in their assigned area.  These summaries have, to date, been written in a series of WordPerfect documents.

The January 30, 2009 launch of Linus Pauling Day-by-Day is being referred to, internally, as Calendar 1.5.  This is in part because of several major workflow changes that we have on tap for future calendar work, a big part of it being a movement out of WordPerfect.  While the word processing approach has worked pretty well for our students – it’s an interface with which they’re familiar, and includes all the usual advantages of a word processing application (spellchecking, etc.) – it does present fairly substantial complications for later stages of the work flow.

For one, everything has to be exported out of the word processing documents and marked-up in XML by hand.  For another, the movement out of a word processor and into xml often carries with it issues related to special characters, especially “&,” “smart quotes” and “em dash,” all of which can play havoc with certain xml applications.

Our plan for Calendar 2.0 is to move out of a word processing interface for the initial data entry in favor of an XForms interface, but that’s fodder for a later post.

The “Year XML”

Once a complete set of data has been compiled in WordPerfect, the content is then moved into XML.  All of the event summaries that our students write are contained in what might be called “Year XML” records.  An example of the types of data that are contained in these XML files is illustrated here in fig. 2.  Note that the information in the fig. 2 slide is truncated for display purposes – all of the “—-” markers represent text that has been removed for sake of scaling the illustration – but that generally speaking, the slide refers to the contents of the January 2, 1940 and August 7, 1940 Day-by-Day pages, the latter of which will also serve as our default illustrations reference.

Cursory inspection of the “Year XML” slide reveals one of the mark-up language’s key strengths – it’s simplicity.  For the most part, all of the tags used are easily-understandable to humans and the tag hierarchy that organizes the information follows a rather elementary logic.  The type of record is identified using <calendar>, months are tagged as <month>, days are tagged as <day> and document summaries are tagged as <event>.

The one semi-befuddling aspect of the “Year XML” syntax is the i.d. system used in reference to illustrations and transcripts.  After much experimentation, we have developed an internal naming system that works pretty well in assigning unique identifiers to every item in the Pauling archive.  The system is primarily based upon a document’s Pauling Catalogue series location and folder identifier, although since the Catalogue is not an item-level inventory (not completely, anyway) many items require further description in their identifier.  In the most common case of letters, the further description includes identifying the correspondents and including a date.

Fig. 2 provides an example of three identifiers.  The first is <record><id series=”09″>1940i.38</id></record>, which is the “Snapshot” reference for the 1940 index page.  This identifier is relatively simple as it defines a photograph contained in the Pauling Catalogue Photographs and Images series (series 09), the entirety of which is described on the item level.  So this XML identifier utilizes only a series notation (“09”) and a Pauling Catalogue notation (1940i.38).

The two other examples in Fig. 2 are both letters.  The first is <record><id series=”01″>corr136.9-lp-giguere-19400102</id></record>, a letter from Linus Pauling to Paul Giguere located in the Pauling Catalogue Correspondence series (series 01) in folder 136.9, and used on Day-by-Day as the illustration for the first week of January, 1940.  Because the folder is not further described on the item level, there exists a need for more explication in the identifier of this digital object.  Hence the listing of the correspondents involved and the date on which the letter was written.

The second example is a similar case: <record><id series=”11″>sci14.038.9-lp-weaver-19400807</id></record>, used as the Day-by-Day illustration for the first full week of August 1940.  In this instance, however, the original document is held in Pauling Catalogue series 11, Science, and is a letter written by Pauling to Warren Weaver on August 7, 1940.

METS Records to Power the Illustrations

We’ve talked about METS records a few times in the past, and have defined them as “all-in-one containers for digital objects.”  The Pauling to Weaver illustration mentioned above is a good example of this crucial piece of functionality, in that it is used as a week illustration in the August 1940 component of the Day-by-Day project, and is also a supporting document on the “It’s in the Blood!” documentary history website.  Despite its dual use, the original document was only ever scanned once and described once in METS and MODS.  Once an item is properly encoded in a METS record, it becomes instantly available for repurposing throughout our web presence.

Just about everything that we need to know about a scanned document is contained in its METS record.  In the case of Day-by-Day, we can see how various components of the Pauling to Weaver METS record are extracted to display on two different pages of the project.  Fig. 3 is a screenshot of this page, the “Week Index View” for the August 7, 1940 Day-by-Day page (all of the days for this given week will display the same illustration, but will obviously feature different events and transcripts listings).  Fig. 4 is a screenshot of the “Full Illustration View,” wherein the user has clicked on the Week Index illustration and gained access to both pages of the letter, as well as a more-detailed description of its contents.

Below (fig. 5) is an annotated version of the full METS record for the Pauling to Weaver letter.  As you’ll note once you click on it, fig. 5 is huge, but it’s worth a look in that, among other details, it gives an indication of how different components of the record are distributed to different pages. For example:

  • The Object Title, “Letter from Linus Pauling to Warren Weaver,” which displays in both views.
  • The Object Summary, “Pauling requests that Max Perutz…,” which displays only in the Full Illustration View.
  • The Object Date, used in both views.
  • The Local Object Identifier, sci14.038.9-lp-weaver-19400804, which displays at the bottom of the Full Illustration View.
  • The Page Count, used only in the Week Index View.
  • Crucially, the 400 pixel-width jpeg images, which are stored in one location on our web server (corresponding, again, with Pauling Catalogue series location), but in this example retrieved for display only in the Week Index View.
  • And likewise, the 600 pixel-width jpeg images, which are retrieved for Day-by-Day display in the Full Illustration View, but also used for reference display in the Documentary History projects.
fig. 5 An annotated version of the full METS record for digital object sci14.038.9-lp-weaver-19400804

fig. 5 An annotated version of the full METS record for digital object sci14.038.9-lp-weaver-19400804

An additional word about the illustrations used in Linus Pauling Day-by-Day

One of the major new components of the “1.5 Calendar” launch is full-page support for illustrations of ten pages or less – in the 1.0 version of the project, only the first page of each illustration was displayed, no matter the length of the original document.  Obviously this is a huge upgrade in the amount and quality of the content that we are able to provide from within the calendar.  The question begs to be asked, however, “why ten pages or less?”

In truth, the ten pages rule is somewhat arbitrary, but it works pretty well in coping with a major scaling problem that we face with the Day-by-Day project.  Users will note that the “Full Illustration View” for all Day-by-Day objects presents the full page content (when available) on a single html page, as opposed to the cleaner interface used on our Documentary History sites.  There’s a good reason for this.  In the instance of the Documentary History interface, essentially two html pages are generated for every original page of a document used as an illustration: a reference view and a large view.  This approach works well for the Documentary History application, in that even very large objects, such as Pauling’s 199-page long Berkeley Lectures manuscript, can be placed on the web without the size of a project exploding out of control – the Berkeley Lectures comprise 398 html pages, which is a lot, but still doable.

Linus Pauling Day-by-Day, on the other hand, currently requires that the full complement of images theoretically comprising an illustration be used, specifically, for each unique day of the week for which an image is chosen.  In other words, if the Berkeley Lectures were chosen to illustrate a week within the calendar, and the full content of the digital object were to be displayed for each day of that week using the same clean interface as a Documentary History, a sum total of 2,786 (199 x 2 x 7) html pages would need to be generated to accomplish the mission.  For that week only.  Obviously this is not a sustainable proposition. By contrast, the current version 1.5 approach always requires 7 html pages for each week, though full image support and super-clean display are sometimes sacrificed in the process.

Calendar 2.0 will deal with the issue using a database approach, but again, this is a different topic for a different time.

Last but not least, TEI Lite

We’ve discussed TEI Lite in the past as well and will not spend a great deal of time with it here, except to reiterate that it is a simple mark-up language that works well in styling full-text transcripts and other similar documents for the web.

There are nearly 2,000 TEI Lite documents included in Linus Pauling Day-by-Day, virtually all of them transcripts of letters sent or received by Linus Pauling.  Transcript references within the Year XML are illustrated in fig. 2 above – they follow the exact same naming convention as our METS records, except that the mets.xml suffix is replaced by tei.xml.  It is worth noting that rough drafts of most of the text that was eventually encoded in TEI for the Day-by-Day project were generated using OCR software.  And while OCRing has improved mightily over the years, it still does have its quirks, which is why some of you might find, for example, the lower-case letter l substituted for by the number 1 in a few of the transcripts currently online. (we’re working on it)

The TEI Lite mark-up for the Pauling to Weaver letter is illustrated in fig. 6, as is, in fig. 7, the encoding for the biographical chronology (written by Dr. Robert Paradowski) used on the 1940 index page.  Note, in particular, the use of <div> tags to declare a year’s worth of information in the Paradowski mark-up.  These tags were included as markers for the xsl stylesheet to pull out a given chunk of data to be placed on a given year’s index page.  The entire Paradowski chronology will be going online soon, and once again that project, as with Day-by-Day, will be generated from only this single record.

fig. 6 The TEI Lite mark-up for Pauling's August 7, 1940 letter to Warren Weaver.

fig. 6 The TEI Lite mark-up for Pauling's August 7, 1940 letter to Warren Weaver.

fig. 7 The TEI Lite mark-up for one year of the Paradowski chronology

fig. 7 The TEI Lite mark-up for one year of the Paradowski chronology

Custom XML, METS records and TEI Lite – these are the building blocks of Linus Pauling Day-by-Day.  Check back later this week when we’ll discuss the means by which the blocks are assembled into a finished website using XSL stylesheets.

2008: The Year in Pauling

Linus Pauling at his Deer Flat Ranch home, near Big Sur, California. 1987.

Linus Pauling at his Deer Flat Ranch home, near Big Sur, California. 1987.

Notable Projects and Events

This has been a terrifically-productive year for the Oregon State University Libraries Special Collections:

Behind the Numbers

The various websites that we have launched over the years continue to attract a fairly large volume of traffic.   Over the past twelve months, our web domain has been the focus of 11.93 million pageviews. (A pageview being officially defined as “A request to the web server by a visitor’s browser for any web page; this excludes images, javascript, and other generally embedded file types.”)  This total is a marked decrease from the 2007 measurement of 14.7 million pageviews.  However, our new releases this year were more of a niche variety, whereas 2007 marked the launch of “Linus Pauling and the International Peace Movement: A Documentary History,” as well as two additional new years of Day-by-Day content.  The difference in these types of projects help explain the downturn.

The largest source of 2008 traffic (4.48 million pageviews) is an oldie but a goody – Linus Pauling Research Notebooks.   Originally released in 2002 and consisting of well-over 15,000 html files, this cross-indexed digital version of Pauling’s 46 research notebooks has, by our count, generated roughly 39.5 million pageviews over the course of its existence.  The research notebooks site is also the only one of our many Pauling-centric projects to bubble up into the top 10 of Google’s results for the simple Linus Pauling keyword search. (not that we’re complaining, of course…)

Second in popularity is, per usual, the mammoth Linus Pauling Day-by-Day project (3.71 million), which currently provides a daily accounting of Linus and Ava Helen Pauling’s activities for the years 1930-1954.

Our four Documentary History websites jockey back and forth for third through sixth places.  Having received a big update in February, the Bond site is a clear favorite right now, though Blood will probably move up as well, having also been recently revised.  Here’s a look at how the numbers are shaking out for the major projects under the specialcollections/coll/pauling domain.

stats

Check back on Friday for a few thoughts on search and a peek at 2009.

A New TEI Lite Project: The Pauling Student Learning Curriculum

This past Friday we launched a new project about which we’re pretty excited.  As described in this press release, the Pauling Student Learning Curriculum is geared toward advanced high school- and college-age students, and is applicable to the teaching of both history and science.  As the press release also notes, the large amount of illustrative and hyperlinked content included in the website makes this a resource that should be useful to teachers and students anywhere in the world.

The history of this project is a long and interesting one.  The curriculum itself was originally designed nearly ten years ago for use by visiting fellows of the Linus Pauling Institute.  Over time, the content that was developed for the fellows program was repurposed for use by a University Honors College chemistry class that conducts research on the Pauling legacy every Winter term.  For several years we’ve been planning to post the text of the curriculum online, thinking that doing so would assist those chemistry students whose busy schedules preclude their spending an optimum amount of time in the Special Collections reading room.  It eventually dawned on us that the curriculum could actually be expanded into a powerful resource for use by teachers well-beyond the Oregon State University campus, and we’ve been developing the project with that goal in mind ever since.

The bulk of the curriculum is devoted to an abbreviated survey of Pauling’s life and work, presented in chronological order, and grouped under the following headings:

Throughout these sections, we’ve linked to any applicable objects that have already been digitized in support of our various Documentary History and Primary Source websites.

The curriculum also includes a series of instructions on “rules for research” in an archive.  We feel that this is especially important given the youth of our target audience, and hope that it will likewise provide for a positive introduction to the in’s and out’s of conducting scholarship with primary sources — an oftentimes intimidating process for researchers at any level.

The website itself is built with TEI Lite, which we’re using more and more in support of small but clean webpages that can be created and released comparatively quickly.   Though we’ve used the TEI (Text Encoding Initiative) Lite standard for numerous transcripts projects in the past, the first of our sites to be built entirely in TEI Lite was the biographical essay “Bernard Malamud: An Instinctive Friendship,” written by Chester Garrison and posted on our Bernard Malamud Papers page last month.  Plans for several additional TEI Lite-based “microsites” are currently in the works.

TEI Lite is a terrific tool in part because it is very simple to use.  In the example of the curriculum, all of the text, images, administrative metadata and much of the formatting that appears on the finished site are encoded in easily-learned and interpreted tags.  (We used XSL to generate the table of contents and to standardize the page formatting — e.g., where the images sit on a page and how the captions render.)  As a result, most of the mark-up required for these projects is at least roughed out by our student staff, which makes for a pretty efficient workflow within the department.

An example of the TEI Lite code for Page 2 of the Pauling Student Learning Curriculum is included after the jump.  We’ll be happy to answer any reader questions in the Comments to this post.

Continue reading