Veridian Blog - Resources on Digitizing Archives & Historic Newspapers

Planning a Newspaper Digitisation RFP: What to Include (and What Most Libraries Miss)

Written by Admin | Apr 21, 2026 10:27:06 PM

What should you include in a newspaper digitisation RFP? This guide covers key requirements, technical standards, and common pitfalls to help you plan with confidence.

Digitising historical newspapers is a major investment—one that shapes how your collection is preserved, accessed, and discovered for years to come.

For many libraries and archives, the process starts with an RFP (Request for Proposal) or RFB (Request for Bids). Defining the right requirements, however, isn’t always straightforward.

Recently, a university library reached out to peers looking for examples and guidance for issuing a multi-year digitisation contract—aiming to streamline procurement and lock in pricing across multiple projects. It’s a familiar challenge, and one that highlights an important reality:

A digitisation RFP is only as strong as the technical detail behind it.

Without clear specifications, even well-intentioned projects can run into issues with quality, consistency, and long-term usability.

Why this matters longterm

The decisions made at the RFP stage directly impact:

  • How easily users can search and explore content.

  • How well the collection integrates with platforms.

  • The ability to scale and expand over time.

  • The overall return on your digitisation investment.

What your RFP should clearly define

Below are the core areas every newspaper digitisation RFP should cover, based on real-world requirements used in active projects.

1. Metadata and standards

At the heart of any digitised newspaper collection is structured metadata.

Most large-scale projects align with standards defined by the Library of Congress, particularly:

  • METS (Metadata Encoding and Transmission Standard) for document-level structure.

  • ALTO (Analyzed Layout and Text Object) for page-level OCR and layout.

Your RFP should clearly state:

  • Which standards must be followed.

  • How they should be applied.

  • Any validation requirements.

This ensures consistency and interoperability with downstream systems.

Related reading: Archival Metadata Standards Guide

 

2. Segmentation (often overlooked)

Newspaper digitisation isn’t just about capturing images and text—it’s about preserving how content is structured and read.

Newspapers don’t follow a simple, linear reading flow like a book. Instead, content is arranged across multiple columns, with articles that may continue across sections or pages. Headlines, advertisements, illustrations, and editorial content all sit alongside one another within the same layout.

If the structure isn’t captured correctly during digitisation, the content can feel disjointed—making it difficult for users to follow articles or understand how different elements relate to each other.

You’ll need to define whether your project requires:

Page-level segmentation

  • Content is structured at the page level.

  • Text blocks follow a defined reading order (e.g. column-by-column, top-to-bottom).

  • The page remains the smallest unit of structure.

  • Search results return pages, not individual articles.

Article-level segmentation

  • Individual articles are identified within each page.

  • The article becomes the smallest unit of structure, rather than the page.

  • Search results return articles, making them more precise.

  • Articles can be categorised (e.g. advertisements, illustrations, family notices), enabling more refined search options.

In many projects, the level of segmentation is ultimately driven by budget—but if it isn’t clearly defined upfront, it can lead to inconsistent outputs and limitations in how your collection can be used.

3. File outputs and deliverables

Your RFP should remove ambiguity around what gets delivered and typically include issue and page requirements – examples below.

Per issue:

  • One METS XML file (document-level metadata).

  • One issue-level PDF.

Per page:

  • One ALTO XML file (page-level metadata).

  • One JPEG2000 image.

  • One page-level PDF.

Clear definitions here help avoid inconsistencies across batches and vendors.

4. Image and PDF specifications

Technical specifications matter more than they might seem—they affect performance, storage, and usability.

For example:

  • JPEG2000 images

    • Defined compression ratios and quality layers.

    • ICC color profiles included.

    • Consistent tile sizes to support efficient image delivery.

  • PDFs

    • Searchable OCR text embedded behind images

    • Optimised resolution (DPI) for usability and file size

    • Exclude unnecessary elements, such as (bookmarks, links annotations, scripts,embedded thumbnails, etc.)

5. Batch structure and consistency

Digitisation at scale requires consistent directory and naming structure.

While file-naming conventions vary by project, a commonly used directory structure looks like this:

<batch-name>/<publication-code>/<year>/<month>/<day>/<files>

Using this structure, a three-page issue of the The Daily Post from 25 April 1880, uploaded on 30 March 2026 might be organised as follows:

batch-20260330/

└── TDP

└── 1880

└── 04

└── 25

├── TDP_1880425_METS.xml

├── TDP_1880425_issue.pdf

├── ALTO/

└── TDP_1880425_ALTO_0001.xml

└── TDP_1880425_ALTO_0002.xml

└── TDP_1880425_ALTO_0002.xml

├── MASTER/

└── TDP_1880425_0001.jpg

└── TDP_1880425_0002.jpg

└── TDP_1880425_0003.jpg

└── PAGEPDF/

└── TDP_1880425_0001.pdf

└── TDP_1880425_0002.pdf

└── TDP_1880425_0003.pdf

6. Delivery and validation

Finally, your RFP should define how files are packaged and verified.

Common approaches include:

  • Packaging using BagIt.

  • Generating checksums (e.g. MD5) for all files.

  • Defining any validation requirements for metadata to avoid common issues, such as hidden ALTO errors.

This helps ensure files are complete and can be ingested reliably.

What most digitisation RFPs miss

Even well-structured RFPs often leave gaps that create challenges later.

Some of the most common include:

  • Unclear segmentation requirements = inconsistent article extraction.

  • No validation criteria for METS/ALTO = unusable or inconsistent metadata.

  • Lack of defined batch structure = ingestion and scaling difficulties.

  • No consideration of end-use (search, discovery, user experience) = poor alignment with user needs.

  • Overly generic deliverables = too much interpretation left to vendors.

Addressing these upfront can save significant time and cost down the line.

Planning your next digitisation project?