|
As the volume of digital information continues to grow, we are faced with a paradox. We can read and interpret the Dead Sea scrolls written almost 2000 years ago, but we cannot do the same with data generated 20 years ago on a 5.25 inch floppy disk. Ironically, as the world becomes digital, we may be entering a digital "Dark Ages" in which business, public, and personal assets are in ever greater danger of being lost. But, on the other hand, there is an increased need for long-lived digital information. Recent compliance legislation, such as HIPAA and the Sarbanes-Oxley Act, which require long-term data viability, have increased the need to study how to preserve myriad types of information, such as scientific, financial, healthcare, artistic, and cultural data for tens and even hundreds of years.
Preserving information is more than just storing bits of data. It involves preserving the understandability and usability of complex interrelated objects even when technologies for computer hardware, operating systems, data management products, and applications are replaced with newer ones -- and as data consumers (designated communities) change frequently. This poses new requirements to ensure long-term access and understandability, while enabling new interpretations of the same data.
At the heart of any solution to the preservation problem resides a storage component, which is the permanent location of the information. Traditional archival storage considers only bit preservation, if it considers preservation issues at all. We argue that in order to better preserve data and understandability for long periods, a new type of storage must emerge to take preservation considerations into account.
Preservation DataStores (PDS) is such a novel storage component that supports digital preservation environments ensuring data usability and integrity over long periods of time. PDS supports new functionalities and extensions that are specific for logical preservation. It encapsulates the raw data with its complex interrelated metadata objects, so they are inseparable during the migration processes and when accessing the data in the future. PDS decreases the data transfer between the applications and the storage by offloading data intensive functions such as fixity computations and transformations to the storage. The PDS storage system simplifies the applications by transferring the responsibility for managing the storage-related events, such as provenance events, to the storage itself.
We have proposed some PDS concepts to standard organizations. For example, we have proposed that SNIA standardize a self-describing self-contained data format (SD-SCDF). The objective of the SD-SCDF is to facilitate transparently moving media from storage system A to storage system B, while supporting long term archive (preservation). Here, "moving media" means physically removing the media from storage system A and putting it on storage system B. By transparent we mean that storage system A is not involved. All the information needed for storage system B to understand the media is self-described and self-contained within the media.
PDS will serve as an infrastructure component of CASPAR, a European Union project that is building a framework to support the end-to-end preservation lifecycle for scientific, artistic, and cultural information.
|