What files are created during a newspaper digitisation project? This article breaks down the full file set behind a digitised newspaper issue and how they’re typically organised.
Most modern newspaper digitisation projects produce a set of related files that together describe and represent a single newspaper issue. Collectively, these files are often referred to in digital-preservation workflows as a digital object
For clarity, in this article we use the term file set to describe all of the files associated with one newspaper issue.
What files make up a digitised newspaper issue?
A typical file set produced by a newspaper digitisation project includes the following components.
METS XML (one per issue)
Each newspaper issue is described by a single METS (Metadata Encoding and Transmission Standard) XML file.
This file acts as the master record for the issue. It contains descriptive and structural metadata and typically includes links to all of the other files that make up the complete file set for that issue.
ALTO XML (one per page)
For each page of the newspaper issue, there is usually a corresponding ALTO (Analyzed Layout and Text Object) XML file.
These files contain:
-
The OCR-generated text for the page, and
-
Detailed layout information describing how text appears on the page.
ALTO files are critical for enabling full-text search, text highlighting, and advanced discovery features in digital newspaper platforms.
Related Reading: Metadata Standards: What is METS/ALTO?
Preservation master images (TIFF, one per page)
Most projects create an uncompressed TIFF image (typically 300 DPI) for each page of the newspaper issue.
These TIFF files:
-
Are very large (often 40–80 MB per page),
-
Serve as high-quality preservation masters, and
-
Remain the industry standard for long-term digital preservation.
While many projects choose to archive these files, others opt to discard them after creating derivative images, usually retaining JPEG 2000 files instead.
Access images (JPEG 2000, one per page)
For online access and display, projects typically generateJPEG 2000 (JP2) image files derived from the original TIFFs.
These files are:
-
Significantly smaller than TIFFs,
-
Optimised for zooming and web delivery, and
-
Ideally compliant with the National Digital Newspaper Program (NDNP) JPEG 2000 profile.
In most digital newspaper platforms, JPEG 2000 files are used in place of TIFFs for public access.
Issue-level PDF (optional)
Some projects also create a single, multi-page PDF representing the entire newspaper issue.
These PDFs often include:
-
All pages of the issue, and
-
An embedded text layer generated from OCR.
While optional, issue-level PDFs remain popular because they provide a familiar and convenient way for users to view, print or download a complete issue.
Page-level PDFs (optional)
In addition to issue-level PDFs, many projects generate individual PDF files for each page of the newspaper issue.
These files are also optional but are commonly produced because they make it easy for patrons to print or download individual pages without handling large image files.
File organisation and naming
All files associated with a single newspaper issue are typically stored together using a consistent directory and naming structure. While file-naming conventions vary by project, a commonly used directory structure looks like this:
<batch-name>/<publication-code>/<year>/<month>/<day>/<files>
Using this structure, a three-page issue of the New York Times from 1 July 1940 might be organised as follows:
BATCH1/NYT/1940/07/01/NYT_19400701_mets.xml
BATCH1/NYT/1940/07/01/NYT_19400701_issue.pdf
BATCH1/NYT/1940/07/01/NYT_19400701_ALTO_0001.xml
BATCH1/NYT/1940/07/01/NYT_19400701_ALTO_0002.xml
BATCH1/NYT/1940/07/01/NYT_19400701_ALTO_0003.xml
BATCH1/NYT/1940/07/01/NYT_19400701_0001.tif
BATCH1/NYT/1940/07/01/NYT_19400701_0002.tif
BATCH1/NYT/1940/07/01/NYT_19400701_0003.tif
BATCH1/NYT/1940/07/01/NYT_19400701_0001.jp2
BATCH1/NYT/1940/07/01/NYT_19400701_0002.jp2
BATCH1/NYT/1940/07/01/NYT_19400701_0003.jp2
BATCH1/NYT/1940/07/01/NYT_19400701_0001.pdf
BATCH1/NYT/1940/07/01/NYT_19400701_0002.pdf
BATCH1/NYT/1940/07/01/NYT_19400701_0003.pdf
Need help with a newspaper digitisation project?
If you’re planning a newspaper digitisation project—or reviewing files from an existing one—understanding how these file sets are structured is critical for long-term preservation, discovery, and access.
If you have questions about:
-
Required file formats (METS, ALTO, TIFF, JPEG 2000, PDFs)
-
NDNP-aligned workflows and standards
-
File validation, organisation, or ingestion
-
Preparing digitised newspapers for access platforms
Get in touch with our team — we’re happy to talk through your project and help you determine the best approach for your collection.