Monday, March 21, 2011

Society for Textual Scholarship

March 16-18 I attended the Society for Textual Scholarship (STS) conference at Penn State. STS is an association of scholarly editors, bibliographers, book historians, librarians, digital humanists, and regular old literature scholars who are interested in textual criticism.

On Wednesday I taught a seminar to faculty on "Making Primary Sources Primary." We looked at the different skills students (especially undergraduates) need to build when they work with primary sources & assignments to develop those skills, with an emphasis on how to work with special collections.

Other highlights included:

* A really useful panel on data forensics, wherein I learned some of the latest thinking on taking in born-digital materials.

* A wonderful panel on platform studies, which included a presentation by Kari Krauss at University of Maryland (in the School of Information and Department of English) on the Preserving Virtual Worlds project head by the Library of Congress.

* A great panel on "Editing Digital Feminisms," which was followed by another great panel on Emily Dickinson. To attend, however, I had to miss the "Poetics of Metadata" panel; sad.

* Very cool plenary presentations by Will Noel of the Walters on the Archimedes Palimpsest; David Stork of Stanford and Ricoh Innovations on "digital connoisseurship of master paintings"; and by Lisa Gitelman of NYU on typescript editions of the 1930s, which focused on a strange creation of the Social Science Research Council called "Methods of Reproducing Scholarly Materials"--which turns out to have been published by the guy who went on to found UMI, which of course became ProQuest. Does this mean we need to start thinking more carefully about preserving out-dated research tools because of what they can tell us about the history of disciplines and scholarly information?

This year STS had a dedicated DH track curated by Matt Kirschenbaum at University of Maryland, this year's program chair. It was awesome, and the conference as a whole was excellent.

Thursday, March 17, 2011

SKOS-2-HIVE workshop

I attended the SKOS-2-HIVE workshop held at George Washington University on March 9th, 2011.
HIVE (Helping Interdisciplinary Vocabulary Engineering) is an IMLS funded project.  It is an automatic metadata generation approach that dynamically integrates discipline-specific controlled vocabularies encoded with the Simple Knowledge Organization System (SKOS).
The morning session introduced the basic concepts of thesauri, taxonomies, ontology, and SKOS.  We also experimented with the demo version of HIVE system, which integrates vocabulary in LCSH, MESH, NBII, AGROVOC. HIVE Concept Browser allows users to browse and search concepts in interdisciplinary vocabularies.  It is nice tool to finding broader, narrower or related terms across multiple vocabularies.  HIVE Indexing automatically extracts concepts from a given document to aid the cataloging and indexing practice.

The afternoon session covered the technical backend of the HIVE system, which consists of triple store(RDF data), Lucene, KEA(for indexing), and APIs.  The demo HIVE is for demo only.  The project is to prove the concept.  The code is open source.  What comes after is still open in the air.  This is a simple demostration of linked data.  The HIVE APIs can be useful in building metadata management tool.

Tuesday, March 15, 2011

Webwise 2011-posted by Isabelle Kargon.

I was at Webwise 2011, March 9-11. Science, Technology, Engineering and Math (STEM) in Education, Learning and Research ( ).

It was held at the Renaissance Harborplace in Baltimore. I attended the main conferences on Thursday and Friday.

The first keynote presentation was given by Joshua Greenberg, Director of the A. P. Sloan Foundation’s Digital Information. He talked about the main topic of this information age: data, and how to store them. He mentioned several projects of interest, among them MoBeDAC ( ), a microbiome of the built environment data management site. He mentioned Google’s N-Gram viewer ( ) that allows a keyword search through 2 or 3 characters chunks from a full-text corpus over centuries.

The main problem, though, is that while creating data is easier and easier, many institutions have insufficient funding for proper data storage and curation, one big difficulty being computation. Solutions reside in the facilitation of data sharing and of interoperability. There is a need to train the workforce in information technology and to create a broader collaboration between institutions, with more roles given to data-oriented professions. Libraries and museums need to network their data between institutions, that is expose data in common formats. In the 19th century, items were the data. Nowadays they are artifacts and digital books are data. For small institutions, data curation is important to help communities to figure out what to do with their data.

More examples of common data services and data sharing websites were given: Encyclopedia of Life (, the Zooniverse Project ( ) which uses the participation of non-professional citizens in science projects, and Wikipedia.

There were many projects presentations, among which I noticed especially Connecting Content by the California Academy of Sciences: a collaboration to link field notes to specimens and published literature. The Macaulay Library in Cornell University had the project The NPR/NGS Radio Expeditions Sounds Collection. The National Building Museum in Washington, D. C. presented Information Technology and Online Services Initiative, etc…

The first panel session was on Perfecting STEM Partnerships: Libraries, Museums and Formal Education. The stress was on developing 21st century skills in collaboration with formal education partners.

Marcia Mardis (Florida State U.) spoke of the project DL2SL (Digital Library to School Library). Its goals are the creation of tools for rapid creation and export of MARC records for Web objects. Metadata creation tools would increase the “findability” of digital content.

Susie Allard (U. of Tennessee)’s talk was Environmental Science Librarians, Oh! My! She talked of new paradigms involving the Web 2.0 and user-generated content. She stressed also the importance of collaboration and data sharing. She gave, among others, the example of DataONE ( ), a collaboration project that presents a “sustainable cyber-infrastructure that meets the needs of science and society for open, persistent, robust, and secure access to well-described and easily discovered Earth observational data.” M. Mardis talked of the necessity of developing shared curricula for research centers, strengthening institutional and research ties, revitalizing traditional skills and developing new ones towards a new frontier of information creation, organization, and dissemination. Institutions should develop new knowledge sets and consider a dynamic content for diverse user skills, such as what is found with blogs, Wiki spaces, Flickr, etc.

Kwasi Asare, from the U. S. Dept. of Education, talked of learning powered by technology. He mentioned the need for interaction in learning, the importance of digital content. The goals to consider in education are learning in an engaging way, assessment for continuous improvement, teaching, infrastructure, and productivity. Currently the investment in education produces a bad ROI (return on investment). There is a need for a more efficient use of time, money and staff. The challenges facing educators are in the management of print material going to digital material, and access to resources for teachers.

This talk elicited questions about the engagement of school libraries in collaboration, and on the way school assessment is conducted through tests leading to a teaching oriented mostly towards tests results.

The second session, moderated by Erika Shugart (National Academy of Science) was on STEM and the Participatory Web: Everyone is Invited! It presented projects generating meaningful audience participation.

Bridget Butler (conservation education specialist, Echo Lake Aquarium at Lake Champlain, and environmental reporter for NBC affiliate News Channel 5) presented Voices for the Lake. The project encourages online and on site engagement from the public, with the goal of contributing stories and connecting with the community around Lake Champlain.

Seth Cooper (U. of Washington) talked about the scientific video-game Foldit ( ) that allows people to play while contributing to predict three-dimensional protein structures—the way protein molecules fold in space. The users participate in chats and forums, they created a Wiki. When a user-generated model is used in a published paper, the users who helped finding the protein configuration are quoted as co-authors of the paper. The nest step will be users helping to design entirely new proteins.

Jeff Grabill (Michigan State U.) and Kirsten Ellenbogen (Science Museum of Minnesota) talked about Science Buzz ( ): a community of people caring about science and society.

More projects presentations included the Walters Art Museum’s Integrating the Art: China, an interactive resource integrating non-art disciplines (social studies, mathematics, science, …) with works of art. There was a fascinating project from University of California, Berkeley, presented by Carl Haber, on Advancing Optical Scanning of Mechanical Sound Carriers: Connecting to Collections and Collaboration ( ). The principle is to acquire digital maps of the surface of the media (old mechanical recordings), without contact, and then apply image analysis methods to recover the audio data and reduce noise. Sound from old records, wax cylinders, etc. can be recovered. This way the earliest sound recorded in history was restored: it was the phonautograph paper from Edouard-Léon Scott from 1857. A partner for this project is the Library of Congress.

The third session, moderated by Ken Wiggin, was Tapping into Science: Promotign Collections for Use in Teaching, about the creation of digital collections by libraries and museums to provide greater access to their primary and secondary sources.

Kaye Howe (Director of the National Science Digital Library Resource Center) talked—without any notes or any PowerPoint slides) about the educational needs of today’s youth. For today’s young people, learning is linked to technology and to networking. They need textbooks about experimental learning, but these books are scarce. The digital world allows remote people to access networks and resources. There is a need to understand who libraries and museums are making these resources for. Digital data need the same work of contextualization as traditional data. Here are a few quotes given by Kay Howe: “Know who you are talking to” (Aristotle). “We thought it was a problem, and it was a mystery” (Gabriel Marcel). "Paintin' is a lot harder than pickin' cotton. Cotton's right there for you to pull off the stalk, but to paint, you got to sweat your mind" (Clementine Hunter, artist).

Kenning Arlitsch (Willard Marriottt Library, U. of Utah), presented the Western Soundscape Archive ( ), the largest free online archive about natural sounds in Western U.S.

Rebecca Morin (California Academy of Sciences) talked about the Biodiversity Heritage Library ( ), a consortium of 12 natural history and botanical libraries in the U.S. and 2 in the U. K. They try to give access to their collections as well as to primary sources through the Field Book Project.

Francine Berman (Rensselaer Polytechnic Institute) gave the second day keynote presentation: Got Data? The Role of Digital Information in 21st Century Research.

Data-driven research leads to research by professional experts as well as by the community. Data-driven discovery is illustrated by the model of the Milky Way obtained through spectroscopic surveys, or the earthquake model for the San Andreas fault obtained through supercomputation. The PBD (Protein Data Bank) is a worldwide repository made through processing and distributing data on the 3D structure of proteins. To support the life-cycle of data is to ensure their capture from various sources, to edit them (i.e. organize, annotate, etc.), to use/reuse them for modeling, visualization, etc., to publish them, to preserve or destroy them. However data corruption can occur. Data storage is growing as well as its cost. Libraries and museums are good at collaboration and curation and should work with the research world. Data sharing can give a good competitive advantage and help to create a national research data infrastructure.

The fourth session was moderated by Tom Scheinfeldt (George Mason U.) It presented examples of crossover projects that brought disciplines together.

Chris Wildrick (Syracuse U.) talked about Dinosaurs Aesthetics. He is a conceptual artist using dinosaurs imagery and statistics for educational projects.

Fred Gibbs (George Mason U.) presented History and Data. Interested in digging into data on poisoning and criminal intent, he talked about the Proceedings of Old Bailey, the records of felony trials held at Old Bailey ( ). He also mentioned The Victorian Frame of Mind, 1830-1870 by Walter E. Houghton, and projects such as mapping plants at JStor which allows experts to edit inaccurate maps. In conclusion he mentioned the need for flexible ways to get data, the fact that tools impose their own limits, and the need to consider the life cycle of data. Is scientific literacy data literacy? Data are the middle ground of sciences and humanities.

Michael Benson (Kineticon Pictures) talked about the history of data mining, from Assyrian sky charts, the compactly stored Babylonian astronomical data, to the Nasa archives mined for robotic search, curation, and exhibition making.

The fifth session was moderated by Greg Colati (U. of Denver). It was about Reuse and Recycle: Tools and Services for Managing, Preserving and Presenting Data for Sharing.

Sayeed Choudhuri (from our own JHU Digital Research and Curation Center) talked about data conservancy ( He mentioned the problems of access to data and of the time consuming character of curating data on a large scale.A tool to organize data and data conservancy would be an Open Archival Information System (OAIS). He gave some examples of pilot projects illustrating prototyping as strategy: the Proof of Concept based on ice road development, arXiv ( ), an open access project making data connected to publications, IVOA (International Virtual Observatory Alliance), Sakai, an open source integration site, NSIDC (National Snow and Ice Data Center), the Dry Valley Visualization Project that uses a Google Earth interface, and the Coastal Bay Visualization Project.

Leah Melber (Lincoln Park Zoo) talked about Ethograms For Everyone. The Ethosearch project ( ) is a database of ethograms, or animal behaviors. They are sorted by vocalization, localization, feeding and foraging, resting, etc. The users are researchers, biologists, professors, students, animal care professionals, K-12 children.

Aaron Presnall (Jefferson Institute) talked about visualization of data through the Vidi Project ( ). The concept is about empowerment through visualization of data. Data need a narrative for visualization. Bringing information will bring empowerment. Examples of projects were the Lewis and Clark journey and military data from the National Archives of Serbia.

In all, this was a very informative conference. Libraries and museums do have a role to play in education, and many creative projects are making use of the possibilities offered by the fast evolving digital technologies every institution has to be aware of.

Global Mission Driven Performance Institute

I attended and presented at the 3rd Annual Global Mission Driven Performance Institute, March 8-10, 2011 in Washington DC. The other presenters include Drs. Robert Kaplan and David Norton, the founders of the balanced scorecard approach to strategic planning; Tom Harrington, Associate Deputy Director of the FBI; Ellen Liston, Deputy City Manager, City of Coral Springs, Florida, Monica Niemi, Development Manager of Folkhalsan of Finland; and more.

Several things really hit home for me.
  1. Using the balanced scorecard has deeply changed the way many organizations approached both their strategic planning AND the implementation of that plan.
  2. Those organizations that have regular measurement meetings and whose leadership regularly talk about strategy are more effective in achieving their mission.
  3. Leadership is key - if you leave your strategy management to middle management it is a long development process and becomes too operationally focused.
  4. Drive execution of the strategy plan through communication.
  5. Communication - 7 times 7 ways this was a mantra repeated over and over in the presentations
  6. Motto from the FBI - We do, we learn, we do better, we learn more. (I love this!!)
  7. There is tremendous power in clear choices and focus.
  8. Great leadership spends 30% of their time EACH DAY on strategy.
  9. Organizational change and transformation requires consistent change and communication over the course of 5-10 years.
  10. Measurement is extremely important in non-profits because it creates accountability that is often lacking.
  11. Recognize that outcome measures for social change are hard, not impossible, but hard.
  12. Each year the FBI picks 10 key initiatives and those initiatives are fully funded; their strategy runs the budget - the budget does not run the strategy.
Dr. Kaplan delivered a talk on the six step strategy execution closed loop system.
  1. Develop strategy based on your mission, vision, and values - should be 10-15 years out for non-profits with measurable objectives. This should not be about what you do but what you want to accomplish. Do some AS IS modeling: we're here what do we have to do to become XYZ.
  2. Plan your strategy map - executive committee has to create clarity, consensus, and commitment. Then create alignment, educate and communicate, develop feedback loops, and create real accountability.
  3. Align Organization - it is not important if you have 8 great units if you don't have 8 great units that work together well. Leadership aligns and synchronizes the organization.
  4. Plan your operations - map your key processes in detail to drive real results
  5. Improve operationally - monitor and learn. Review your operational dashboards, ask questions (don't blame), separate operation meetings from strategy meetings. Measures should only be questioned one time per year, agree on them, then use them, once you have the measures discuss the data from them. Make it safe for people to report problems
  6. Test and adapt - keep asking, how will we know when we are successful.
It is extremely important to understand the diagnostic piece of the puzzle before getting to the design and deliver piece. This is where I think we often fail as an organization - we are so hot to develop a solution we don'ttake the time to understand the problem we are trying to solve.

This conference continues to be an inspiring conference for understanding the power of the balanced scorecard to create meaningful strategic results.

Tuesday, March 8, 2011

Code4Lib 2011

In February, I attended the 2011 Code4Lib Conference.

"Code4Lib" isn't an organization or business or initiative. It's just a name a bunch of people use informally for a community. It has recently occurred to me that it could be described by the academic notion of "community of practice": "It is through the process of sharing information and experiences with the group that the members learn from each other, and have an opportunity to develop themselves personally and professionally."

The Code4Lib community is composed of library software developers and other technology workers. The participants are characterized by a strong desire to innovate and make things easier for our users -- but also by a focus on practical solutions that can be implemented quickly today, in our actually existing messy environment. This contrasts with other library organizations or communities more focused on more long-term or "blue sky" research. This priority put on solutions with immediate 'bang for the buck' springs, in part, from most participants positions in their library organizations: Not in resource-rich grant-funded departments, but in departments working with legacy systems, without explicit R&D budgets, but still with a passion to make our often challenging systems work better for users.

This context makes the 'community of practice' all the more important, becuase such developers are often more or less working solo on projects at their own institutions, without local technical peers. But in order to do what we do well, it's absolutely vital to have peers in a 'community of practice' to exchange ideas with, get advice and tips from, or mentor and be mentored by.

The annual conference, of which this year's was the 6th, seems to me to actually be one of the most powerful ways this community forms itself as an actual community, with shared knowledge, experiences, and social networks. In-person meeting can develop professional collaborative and information-sharing relationships and trust quicker and better than many megabytes of online interaction. This always ends up being the most useful aspect of the conference for me, more than the concrete information in any particular presentation.

On reflection, some aspects of how the conference is run seem nearly ideal for community-building, through some combination of intention and accident. The conference is single-tracked, with all participants seeing the same presentations: This builds the shared knowledge upon which a community of practice is based; and putting us all on the same page gives us lots of things to talk more about in late night hanging out in the conference hospitality suite, developing our shared understanding yet further. Presentations are only 20 minutes long, and supplemented by 5 minute "lightning talks" which can be given by anyone who wants to sign up during the conference itself: One goal is to maximize the number of audience members who are also up in front presenting, and a kind of egalitarianism that hopefully makes it easier for newcomers to integrate themselves into the community.

One trend I noticed at this year's conference were attempts on building software packages by assembling pre-existing mature components, and generally build software based on individual separate components.

This is, I think, a reaction to many previous efforts to build homegrown monolithic solutions, which ended up revealing our lack of capacity to succesfully build such things in a sustainable way. If we can re-use mature open source components from non-library communities, with their own community of developers, we can take advantage of that work, focusing on our own unique needs on top of that, maximizing our efforts. Likewise, if we can break up our own software projects into individual somewhat independent components, one of those individual components can meet the needs of a wider community, in a way that a more focused monolithic project can't. And by meeting the needs of a wider community, there are more potential collaborators to help on that component.

One example in a presentation at the conference, would be UPenn's efforts to build software to compile, manage, and analyze usage data from library software. A few years ago they were engaged in an effort they called the "data farm". We at Hopkins were at one point considering trying to use their software for our own needs, but ended up not having the time or resources to fully investigate. However, in a presentation at the conference, Tom Barker from UPenn revealed that they themselves had come to realize their first effort at this was not creating efficiently sustainable software, it was taking them too much development time to adapt to new data sources, and was ending up too customized for other institutions to easily adopt or collaborate. They switched gears, and have developed new software they currently call MetriDoc, which is based on several pre-existing mature open source components for transforming and storing data.

There are of course downsides to trying to build from several pre-existing powerful components, just different downsides from trying to build your own thing mostly from scratch (or much smaller pre-existing components). More components means more things the developer has to learn how to use, and more places to debug when something isn't going right. A component written for a more general purpose might not do exactly what you need. If you were counting on an open source component being maintained by an external community of developers, and it gets abandoned, it can be disastrous for your project.

But all approaches have risks, by being aware of the risks you can choose your components (and your approaches in general) to minimize them. The more I write software, the more I realize building complicated software such that it is flexible, sustainable, and expandable is actually difficult, taking skill, experience, time, and a bit of luck. It's not in fact -- despite the self-confidence of youth -- something that is obvious if you are just smart enough and know a couple guidelines.

Hopefully the Code4Lib community can continue to help library developers build their personal skill and experience at developing quality software in a sustainable and efficient way, in order to on aggregate increase the library communities capacity at sustainable technological innovation. I know my participation in the community -- in listservs, at the conferences, on IRC, and through the Code4Lib Journal -- has increased my own capacity immeasurably.