Long-term preservation of digital objects is a challenging problem, and is especially so for newspaper digitization projects. Here are some of our recommendations to guide you through your project.
Many newspaper projects are very large, with huge numbers of objects to store and preserve. The digital objects themselves also tend to be large, and it’s not uncommon for a scanned image of one newspaper page to be 50Mb or more. Large newspaper digitization projects often end up with tens or even hundreds of terabytes of digital objects to store and preserve.
In addition to the logistical challenges of securely storing huge amounts of data there are challenges related to technological changes, data formats becoming obsolete, and more.
Our Veridian-based services are traditionally focused on digitization, discovery, and delivery, and we don’t pretend to offer a complete solution to ensuring digital objects are preserved and usable for decades or centuries. We do offer services to help make this easier however, and we can back up and preserve large amounts of data and ensure it is safe in the medium term (i.e. years, as opposed to centuries!) And we can of course make recommendations based on our experiences with other projects.
Options and recommendations
-
Create digital objects in a standardized data format, preferably METS/ALTO. The long-term benefit of adopting the same standards used by other projects is if that standard ever becomes obsolete you won’t be the only project needing to solve the problem. That is, hundreds of projects have digitized hundreds of millions of newspaper pages as METS/ALTO objects, so if it ever becomes obsolete a suitable migration path is certain to be developed.
-
The industry standard, as recommended by the Library of Congress for the National Digital Newspaper Program (NDNP) is still to archive uncompressed TIFF master images of each newspaper page. These images are very large (often 50-100Mb) so large collections require a huge amount of storage. Some projects choose not to archive these very large images, and instead store JPEG 2000 images. The JPEG 2000 images have the same resolution and quality as the TIFFs, but use lossless compression resulting in much smaller files. This decision to retain uncompressed TIFFs or discard them depends on the practices of the institution digitizing the newspapers, on the budget and infrastructure they have available for long-term preservation, and on how comfortable they are with departing from the accepted “best practice”.
-
While storage space is constantly becoming less expensive it is still relatively difficult and expensive to securely store tens of terabytes of data. Costs depend on the quality of the “preservation” on offer of course. For example, simply storing all the data on commodity hard drives costs relatively little. Hard drives and other media do eventually degrade and fail however. A simple LOCKSS (Lots Of Copies Keeps Stuff Safe) approach is much better, but is usually more complex and costly. And there are of course many other options to consider
We have experience collaborating with specialist preservation services such as MetaArchive and Amazon Web Services Glacier, allowing our team to offer practical guidance to organisations exploring preservation approaches within a broader, standards-based framework.