Tuesday, October 9, 2012

HathiTrust UnCamp

On September 10-11, I attended an Uncamp at the HathiTrust Research Center (HTRC) in Bloomington, Indiana. This turned out to be more of a traditional conference than an uncamp, but it did give me the chance to learn more about the HTRC. I was hoping for more discussion about how libraries are actually using the HathiTrust (HT), but the HTRC looks promising for digital humanities research.

The HTRC's goal is to provide computational access to a large portion of the works in the HathiTrust. Being able to run textual analysis tools on a large corpus such as HathiTrust would enable humanists to do some interesting research. The mix of languages and subjects in the Trust closely resembles that of the physical collections of the contributing libraries. For now access is mostly limited to pre-1923 American publications that we know are in the public domain. They are starting to allow researchers to submit requests for analysis of items that are thought to be in copyright. The rationale behind this is known as "non-consumptive research", that is, the researcher would not be reading or "consuming" the text, but instead, would just be analyzing the words contained there. The proposed text-mining could be done only by researchers at institutions who have a signed agreement with Google as they would primarily be mining works scanned by the Google Books project. All of this is very much up in the air at this point.

The public domain materials in HT comprise about 2,500,000 print volumes. There is about 2.3 TB of raw OCR text and 3.7 TB of managed OCR text. The latter has had some processing and error correction. The entire HT collection includes 10.5 million volumes or around 1,000,000,000 pages! HTRC is building analysis tools for researchers to perform sophisticated operations. For instance, one topic modeling tool looks at how words cluster within a document and across the entire collection. It keeps iterating until it is able to assign words to topics based on the probability that it will occur with similar words. They have also created tools that will check the quality of the OCR and create tag clouds with data. 

What might a philosopher do with millions of words? One interesting research project that used the HTRC infrastructure compared the Stanford Encyclopedia of Philosophy with the Internet Encyclopedia of Philosophy. Analyzing the terms "anthropomorphism" and "parsimony" in the two works yielded wildly different results. Each term was examined to see what words clustered around them in order to provide context for the main term. The convergence and divergence of certain terms gave a good view of the focus of each of the encyclopedias. Much of the math that explained this was beyond me, but the presenter made a good case for its accuracy. Scaling this type of analysis to larger collections will have its issues. The HTRC is going to have to expand its computing power greatly to enable large text-mining projects.