Papers Past contains a range of content including magazines and journals, parliamentary papers and letters and diaries. However, for this article we'll be concentrating on how the NLNZ digitisation team manage their newspaper data.
Papers Past is a website delivering content from the digitisation programme of the National Library of New Zealand (NLNZ), currently containing over 6.7 million 19th and 20th century New Zealand and Pacific newspapers. The Veridian team has been working with Papers Past since 2006, ingesting data into and customizing their use of the Veridian digital collection software to help create Papers Past on the web. However, this article isn't about Veridian as such; we’re focusing on Papers Past’s data management processes. We talked to Tracy Powell from the National Library about how they manage their newspaper data.
Newspapers are Papers Past's largest component and most mature workflow.
The NLNZ digitisation team maintain an annual newspaper digitisation programme. For example, for the year ending June 2022 they might plan to digitise four batches of roughly 100,000 pages each, with the different steps of the batch process planned in the schedule, evenly staggered over the allocated months. The digitisation steps for each batch are (generally):
Each step is managed by a digitisation advisor, who also carries out the acceptance testing and prepares the release for production.
The occasional newspaper is digitised directly from paper, but the majority are digitised from microfilm. Intermediate copies of the microfilm are made and sent off to a vendor who then captures the page images. From those page images the vendor captures the text, produces output in a variety of file formats such as TIFFs and XML, and then sends the data back.
The team then carries out a range of automated and manual acceptance testing on this received data which usually results in a handful of rework. When that is all done, the completed content is delivered through Papers Past.
The data that is received from the vendor consists of:
Preservation masters are unmodified, uncompressed page scans. Modified masters are deskewed, dimensionally smaller, compressed versions of the preservation masters, and are used as a source for OCR and for display via the web-based presentation software. METS and ALTO XML is the industry standard data format for representing newspaper issues in digital form. Page-level and issue-level PDFs are created and made available to make it easy for end users to print selected newspaper material.
When a new batch of data is loaded into the Papers Past data store, it isn't added as a self-contained batch of data. Instead, it goes into an existing directory structure which is arranged by newspaper title. This means that the directory structure itself helps to maintain data uniformity, including helping to prevent duplicate copies of the same newspaper issue from appearing in the data.
When the decision is made to digitise a particular batch of newspapers on microfilm, a manual check is performed to ensure there are no overlapping newspaper issues on that microfilm, compared with newspaper issues already digitised and available on Papers Past. This check is performed to avoid the potential for duplicate newspaper issues to appear in separate source data batches. In general, a single copy (i.e. the best available version) of the source METS/ALTO data of each newspaper issue is maintained in Papers Past at any one time. If there were two copies of the same newspaper issue in the METS/ALTO data, this would be considered a clashing newspaper issue which would need to be resolved, because at most one version of each newspaper issue must be ingested into the production Papers Past installation.
In other projects we notice these clashing newspaper issues during data ingestion, but interestingly we don't think this has ever happened with Papers Past METS/ALTO data because of these two actions:
If duplicates are discovered between batches of digitised newspapers, this data is compared to decide between the separate digitised versions of the same newspaper issues. Typically, the most recently processed newspaper issues are kept and the previously processed issues are removed from the data store, as the most recently processed issues are likely to be better quality than the older data.
The team uses a variety of automated and manual acceptance testing tools. They are currently working on a project to implement an off-the-shelf testing product which will change their workflow to some extent, but for now their testing includes:
For this testing, data integrity batch checksums are checked. When a batch of data is received, checksums made from that data are reviewed to ensure the data has been transferred in its entirety and without modification. Any data that fails checksums will get replaced.
Data is loaded into DiRT (Digitisation Review Tool), the team's current tool for automated testing. DiRT is a bespoke system created by the National Library. For a batch of data to be loaded into DiRT, the data has to follow a set of standard directory and file-naming requirements. If any of the data deviates from this standard it won't load into DiRT and will be replaced.
When the data is in DiRT, a range of automated tests analyses details such as whether all of the files are well formed and valid, and that there are the right number of files.
For example, if a newspaper has four pages, then there should be:
Given the current level of maturity of their digitisation process, there may be only a handful of failed files per batch of data to be reprocessed.
The last step of automated testing is that eventually most of the newspaper data (aside from the PDFs) will be loaded into New Zealand's National Digital Heritage Archive (NDHA). This archive is separate to Papers Past and is the digital preservation repository of the National Library of New Zealand, used to preserve digitised and born-digital material for posterity. As part of the automated testing, the preservation team run a validation process over the data just to check that it meets their requirements for eventual ingestion.
Manual testing is performed on very small samples of data, so is not statistically significant, but it has proved useful in the past in revealing errors.
Once the automated testing is finished, the content is ingested into a QA (Quality Assurance testing) Papers Past installation (separate from the production installation) to help facilitate some of the manual testing listed below. Very occasionally there will be a problem with the data that the team won't have seen yet that this ingestion will reveal, because in order for the data to ingest it must be well formed.
When a batch of METS/ALTO data is ingested into Papers Past, logs are generated and then manually checked. If any newspaper issue fails to ingest, these logs reveal further information. Any newspaper issues with problems picked up here also get reprocessed. In general, all batches of data ingest into Papers Past, because if they don't they are then reprocessed so that they do ingest successfully.
The following additional manual tests are performed:
One instance of article headline manual testing revealed a large number of inaccurate headlines, which led to the discovery of a bug that had overwritten the corrected headlines with the original uncorrected text. This example shows that even though these manual testing samples are not large, they are useful. Testing of the article headline corrections is performed in Papers Past because within the collection it is easy to click on a headline to visit each tested article to establish its accuracy.
These vendor reports are looked at to see if there is anything out of the ordinary. For example, if a newspaper title has an overall page-level accuracy of 56% rather than the normally expected 80% - 90% range it would be verified more closely, to investigate if it really is particularly inaccurate or whether there's been a mistake in the accuracy calculation.
As a typical example, for a batch of 120,000 pages, once all of the testing is done, all batch files will have passed each test, except for something like five files from five newspaper issues. In this case, the vendor would be asked to replace those five newspaper issues.
There are two general cases where the team go back and reprocess content.
Every year a small batch of general re-work is done to address situations where a Papers Past user has noticed a significant error in one of the existing newspaper issues.
It has to be significant due to the considerable amount of work it takes to reprocess and deliver even one issue, due to all the staff who play a role in the process. Numbers vary but it might only be five issues a year that are fixed this way.
The team may also reprocess larger amounts of data if a significant error comes to light later on that was not picked up in acceptance testing. Recently some small titles from the early years were re-processed to try and establish changes or improvements in quality from updates in the processing software. As differences in quality weren't huge, it is unlikely that more titles will be re-processed in this way in the short to medium term.
Veridian believe the NLNZ team has a good newspaper data management process for their programme, because the processes detailed above are designed to ensure the following outcomes: