From Scanned Pages to Searchable Digital Collections

Data conversion is the digitisation stage that follows scanning which transforms scanned images into structured, machine-readable formats that make historic collections searchable, discoverable, and usable by communities.

We specialise in managing the data conversion process for historic newspapers, while also supporting a wide range of other historic materials — including magazines, books, journals, and archival documents.

What we mean by data conversion

High-resolution images (typically uncompressed TIFFs) are converted into structured formats such as METS/ALTO, PDF, and JPEG2000. Optical character recognition (OCR) extracts text and structure, making collections:

  • Fully searchable

  • Easier to browse and navigate

  • More accessible to users online

Standards that support long-term access

Data quality directly affects how usable and sustainable a digital collection will be. Where appropriate, we encourage the use of METS and ALTO XML standards, maintained by the Library of Congress. These widely adopted standards support long-term preservation and interoperability across different content delivery platforms.

METS/ALTO allows collections to:

  • Store full-text content at page and word level
  • Capture physical structure (blocks, lines, word positions)
  • Preserve logical structure such as articles, headlines, and bylines
The result is richer, more reliable digital collections that communities can search, explore, and engage with over time.
Data Conversion Service
Data_Convertion_Service

Data Conversion Options

We offer three data conversion options to suit different material types, structures, and access goals — from page-based publications to more complex, article-driven content.

Automated Page-Level METS/ALTO

This fully automated option produces page-level METS/ALTO, making content full-text searchable and standards-compliant. Being automated, it is the most cost-effective choice for large collections where maintaining the article-level structure is not required.

Estimated cost: $0.15 USD per page

Best suited for:

  • Page-based, text-heavy materials
  • Newspapers, magazines, journals, and reports
  • Collections with good-quality scans or microfilm
Option1-Data-Conversion

Page-Level METS/ALTO with Text Block Auditing

Text blocks are manually audited to reduce common OCR issues such as incorrect block ordering or mis-captured content, improving usability while maintaining a page-based structure..

Estimated cost: $0.28 USD per page

Best suited for:

  • Page-based materials with more complex layouts
  • Newspapers and magazines with varied columns or dense content
  • Books and journals where improved reading order is important
Option2-Data-Conversion

METS/ALTO with Article Segmentation & Headline Cleanup

This option enhances the logical structure of the content, improving how users navigate and understand complex, multi-article pages. It requires human review and is therefore more resource-intensive.

Estimated cost: $0.71 USD per page

Best suited for:

  • Newspapers and periodicals where content is organised into distinct articles
  • Magazines or journals where section-level navigation improves usability
  • Projects prioritising user experience and structured browsing
Option3-Data-Conversion

Cost of data conversion

Data conversion is typically priced per page and varies depending on the structure of the material, the level of automation required, and the conversion approach selected.

Page-level conversion options are generally more cost-effective, while enhanced structural processing — such as audited text blocks or article segmentation — involves additional human review and therefore higher per-page costs.

We’ll help you understand the trade-offs between cost, structure, and usability, and ensure the selected option aligns with your collection and access goals.

Quality assurance 

Quality assurance is a core part of our data conversion service. Before files are delivered, we review the outputs to identify issues that can affect usability, search accuracy, or long-term reliability.

Our quality checks focus on validating structure, consistency, and completeness — helping ensure the converted data meets agreed specifications and performs as expected in access and discovery environments.

This final review helps safeguard your collection and reduces the risk of issues being discovered later in the delivery or preservation workflow.

Need help with data conversion
for your collection?