Case Study: Inside Papers Past’s Data Management Processes

Introduction

Papers Past is a website delivering content from the digitisation programme of the National Library of New Zealand (NLNZ), currently containing over 6.7 million 19th and 20th century New Zealand and Pacific newspapers. The Veridian team has been working with Papers Past since 2006, ingesting data into and customizing their use of the Veridian digital collection software to help create Papers Past on the web. However, this article isn't about Veridian as such; we’re focusing on Papers Past’s data management processes. We talked to Tracy Powell from the National Library about how they manage their newspaper data.

Papers Past contains a range of content including magazines and journals, parliamentary papers and letters and diaries. However, for this article we'll be concentrating on how the NLNZ digitisation team manage their newspaper data.

Digitisation process overview

Newspapers are Papers Past's largest component and most mature workflow.

The NLNZ digitisation team maintain an annual newspaper digitisation programme. For example, for the year ending June 2022 they might plan to digitise four batches of roughly 100,000 pages each, with the different steps of the batch process planned in the schedule, evenly staggered over the allocated months. The digitisation steps for each batch are (generally):

Prepare task order.
Dispatch microfilm.
Copy and store data.
Do acceptance testing.
Organize release to production.

Each step is managed by a digitisation advisor, who also carries out the acceptance testing and prepares the release for production.

The occasional newspaper is digitised directly from paper, but the majority are digitised from microfilm. Intermediate copies of the microfilm are made and sent off to a vendor who then captures the page images. From those page images the vendor captures the text, produces output in a variety of file formats such as TIFFs and XML, and then sends the data back.

The team then carries out a range of automated and manual acceptance testing on this received data which usually results in a handful of rework. When that is all done, the completed content is delivered through Papers Past.

The data that is received from the vendor consists of:

two TIFF images per newspaper page
- one PM (preservation master)
- one MM (modified master)
METS and ALTO XML
page-level and issue-level PDFs

Preservation masters are unmodified, uncompressed page scans. Modified masters are deskewed, dimensionally smaller, compressed versions of the preservation masters, and are used as a source for OCR and for display via the web-based presentation software. METS and ALTO XML is the industry standard data format for representing newspaper issues in digital form. Page-level and issue-level PDFs are created and made available to make it easy for end users to print selected newspaper material.

When a new batch of data is loaded into the Papers Past data store, it isn't added as a self-contained batch of data. Instead, it goes into an existing directory structure which is arranged by newspaper title. This means that the directory structure itself helps to maintain data uniformity, including helping to prevent duplicate copies of the same newspaper issue from appearing in the data.

When the decision is made to digitise a particular batch of newspapers on microfilm, a manual check is performed to ensure there are no overlapping newspaper issues on that microfilm, compared with newspaper issues already digitised and available on Papers Past. This check is performed to avoid the potential for duplicate newspaper issues to appear in separate source data batches. In general, a single copy (i.e. the best available version) of the source METS/ALTO data of each newspaper issue is maintained in Papers Past at any one time. If there were two copies of the same newspaper issue in the METS/ALTO data, this would be considered a clashing newspaper issue which would need to be resolved, because at most one version of each newspaper issue must be ingested into the production Papers Past installation.

In other projects we notice these clashing newspaper issues during data ingestion, but interestingly we don't think this has ever happened with Papers Past METS/ALTO data because of these two actions:

a manual check for overlapping newspaper issues is done before new Papers Past source data is digitised.
when new data batches are added into the Papers Past data store, they aren't added as separate batches, they are loaded into a common directory structure arranged by title.

If duplicates are discovered between batches of digitised newspapers, this data is compared to decide between the separate digitised versions of the same newspaper issues. Typically, the most recently processed newspaper issues are kept and the previously processed issues are removed from the data store, as the most recently processed issues are likely to be better quality than the older data.

Data acceptance testing

The team uses a variety of automated and manual acceptance testing tools. They are currently working on a project to implement an off-the-shelf testing product which will change their workflow to some extent, but for now their testing includes:

Automated acceptance testing

For this testing, data integrity batch checksums are checked. When a batch of data is received, checksums made from that data are reviewed to ensure the data has been transferred in its entirety and without modification. Any data that fails checksums will get replaced.

Data is loaded into DiRT (Digitisation Review Tool), the team's current tool for automated testing. DiRT is a bespoke system created by the National Library. For a batch of data to be loaded into DiRT, the data has to follow a set of standard directory and file-naming requirements. If any of the data deviates from this standard it won't load into DiRT and will be replaced.

When the data is in DiRT, a range of automated tests analyses details such as whether all of the files are well formed and valid, and that there are the right number of files.

For example, if a newspaper has four pages, then there should be:

four PM (preservation master) TIFFs
four MM (modified master) TIFFs
four ALTO XML files
four page-level PDFs
one issue-level PDF
one METS XML file

Given the current level of maturity of their digitisation process, there may be only a handful of failed files per batch of data to be reprocessed.

The last step of automated testing is that eventually most of the newspaper data (aside from the PDFs) will be loaded into New Zealand's National Digital Heritage Archive (NDHA). This archive is separate to Papers Past and is the digital preservation repository of the National Library of New Zealand, used to preserve digitised and born-digital material for posterity. As part of the automated testing, the preservation team run a validation process over the data just to check that it meets their requirements for eventual ingestion.

Manual acceptance testing

Manual testing is performed on very small samples of data, so is not statistically significant, but it has proved useful in the past in revealing errors.

Once the automated testing is finished, the content is ingested into a QA (Quality Assurance testing) Papers Past installation (separate from the production installation) to help facilitate some of the manual testing listed below. Very occasionally there will be a problem with the data that the team won't have seen yet that this ingestion will reveal, because in order for the data to ingest it must be well formed.

When a batch of METS/ALTO data is ingested into Papers Past, logs are generated and then manually checked. If any newspaper issue fails to ingest, these logs reveal further information. Any newspaper issues with problems picked up here also get reprocessed. In general, all batches of data ingest into Papers Past, because if they don't they are then reprocessed so that they do ingest successfully.

The following additional manual tests are performed:

Preservation master TIFFs are checked for image quality, to ensure that the images reproduce the full content of the microfilm frame and meet quality requirements.
A calendar check is performed. When the data is loaded into DiRT it is displayed in a calendar view, similar to Papers Past's calendar view of newspaper issues. If a newspaper was published every Tuesday and Thursday, the calendar is checked month by month looking out for any gaps where there isn't an issue on those days. If a newspaper issue appears on a Wednesday it is reviewed to make sure it is actually a Wednesday issue and hasn't been processed under the wrong date. This process is useful as it has produced occasions where issues have been processed under an incorrect date, or they were on the microfilm and weren't actually processed.
Metadata corrected by the digitisation vendor is tested for accuracy in the QA Papers Past installation:
- Article headline corrections are tested for accuracy.
- Illustration caption corrections are tested for accuracy.
- Any other metadata (occasionally for other formats: authors or language) that is manually corrected is investigated.
One instance of article headline manual testing revealed a large number of inaccurate headlines, which led to the discovery of a bug that had overwritten the corrected headlines with the original uncorrected text. This example shows that even though these manual testing samples are not large, they are useful. Testing of the article headline corrections is performed in Papers Past because within the collection it is easy to click on a headline to visit each tested article to establish its accuracy.
Special instructions: If the vendor has been asked to do something particular during processing, for example, to deal with a newspaper issue supplement in a specific way, this is scrutinized to ensure it has been completed.
The last manual check is of the reports provided by the vendor alongside the data, which may include:
- Lists of vendor image QA (Quality Assurance Testing) including which newspaper issue images have been looked at by the vendor.
- The vendor check of headline corrections.
- Page-level OCR text accuracy: the vendor does an automated test to capture a page-level accuracy for each page, which is based on a small ground truth sample.
These vendor reports are looked at to see if there is anything out of the ordinary. For example, if a newspaper title has an overall page-level accuracy of 56% rather than the normally expected 80% - 90% range it would be verified more closely, to investigate if it really is particularly inaccurate or whether there's been a mistake in the accuracy calculation.

As a typical example, for a batch of 120,000 pages, once all of the testing is done, all batch files will have passed each test, except for something like five files from five newspaper issues. In this case, the vendor would be asked to replace those five newspaper issues.

Reprocessing data

There are two general cases where the team go back and reprocess content.

Every year a small batch of general re-work is done to address situations where a Papers Past user has noticed a significant error in one of the existing newspaper issues.

If it can be fixed by reprocessing the existing scans, and if it's something significant like missing text or an incorrect date for an issue, then it will be reprocessed.
If the error would require scanning or even microfilming again, then it won't be fixed.
If the error is minor such as a typo in a headline, then it also won't be fixed.

It has to be significant due to the considerable amount of work it takes to reprocess and deliver even one issue, due to all the staff who play a role in the process. Numbers vary but it might only be five issues a year that are fixed this way.

The team may also reprocess larger amounts of data if a significant error comes to light later on that was not picked up in acceptance testing. Recently some small titles from the early years were re-processed to try and establish changes or improvements in quality from updates in the processing software. As differences in quality weren't huge, it is unlikely that more titles will be re-processed in this way in the short to medium term.

Why is the Papers Past newspaper data management process high quality?

Veridian believe the NLNZ team has a good newspaper data management process for their programme, because the processes detailed above are designed to ensure the following outcomes:

A generally high level of consistency and uniformity of their METS/ALTO newspaper data is maintained because they have a thorough data acceptance testing process for new data.
After going through this process, if something is wrong, standard mitigation processes are applied, including requiring the digitisation vendor to re-process the data.
If problems are found with newspaper issue data (even years later), they may decide to go back to the digitisation vendor and re-process data to improve it.
All of their newspaper issues import into the production Papers Past installation successfully.
They have no duplicate newspaper issues.
They work to improve this process over time.