I provided a presentation about the Data Conservancy (the name of our DataNet award) that focused on the proposal elements and initial technical architecture. The most important point that I had mentioned was there are significant unknowns and research questions related to data curation infrastructure, particularly related to storage systems. It is tempting to assume that storage is storage and one can simply deposit large amounts of complex scientific data onto storage systems to complete the curation process, but nothing could be further from the truth. Storage is a necessary but far from sufficient condition.
Even at the storage layer, there are fundamental questions regarding failure rates, large-scale system performance, interfaces with preservation and access systems, etc. There are several vendors within the storage hardware and software sector so the ability to span across different vendors is an important consideration as well. More recently, cloud-based storage (and associated services) have become more available as options.
One of the key questions for the Data Conservancy will be comparative advantage. That is, how we do leverage the expertise and capabilities of partners, including commercial storage providers, while focusing on our core areas of expertise such as preservation policies, processes and actions. This type of question could be explored through PASIG.
Both educational and commercial institutions have presented their technology offerings for preservation and archiving through PASIG, which has been helpful in terms of developing a better understanding of the landscape. However, it is my hope that PASIG (and other digital preservation meetings) start to address the "ecological" view of infrastructure development. No single institution will develop the capacity to curate all scientific data, so it's critical for our community to consider the distributed nature of curation, including storage systems. As we build out our local node of infrastructure at Johns Hopkins through the Data Conservancy, we will undoubtedly need to find ways to integrate and connect with other nodes of infrastructure.