Data Conversion
and OCR
From Scanned Pages to Searchable Digital Collections
Data conversion is the digitisation stage that follows scanning which transforms scanned images into structured, machine-readable formats that make historic collections searchable, discoverable, and usable by communities.
We specialise in managing the data conversion process for historic newspapers, while also supporting a wide range of other historic materials — including magazines, books, journals, and archival documents.
What we mean by data conversion
High-resolution images (typically uncompressed TIFFs) are converted into structured formats such as METS/ALTO, PDF, and JPEG2000. Optical character recognition (OCR) extracts text and structure, making collections:
-
Fully searchable
-
Easier to browse and navigate
-
More accessible to users online
Standards that support long-term access
Data quality directly affects how usable and sustainable a digital collection will be. Where appropriate, we encourage the use of METS and ALTO XML standards, maintained by the Library of Congress. These widely adopted standards support long-term preservation and interoperability across different content delivery platforms.
METS/ALTO allows collections to:
- Store full-text content at page and word level
- Capture physical structure (blocks, lines, word positions)
- Preserve logical structure such as articles, headlines, and bylines

Data Conversion Options
We offer three data conversion options to suit different material types, structures, and access goals — from page-based publications to more complex, article-driven content.
Automated Page-Level METS/ALTO
This fully automated option produces page-level METS/ALTO, making content full-text searchable and standards-compliant. Being automated, it is the most cost-effective choice for large collections where maintaining the article-level structure is not required.
Estimated cost: $0.15 USD per page
Best suited for:
- Page-based, text-heavy materials
- Newspapers, magazines, journals, and reports
- Collections with good-quality scans or microfilm
Page-Level METS/ALTO with Text Block Auditing
Text blocks are manually audited to reduce common OCR issues such as incorrect block ordering or mis-captured content, improving usability while maintaining a page-based structure..
Estimated cost: $0.28 USD per page
Best suited for:
- Page-based materials with more complex layouts
- Newspapers and magazines with varied columns or dense content
- Books and journals where improved reading order is important
METS/ALTO with Article Segmentation & Headline Cleanup
This option enhances the logical structure of the content, improving how users navigate and understand complex, multi-article pages. It requires human review and is therefore more resource-intensive.
Estimated cost: $0.71 USD per page
Best suited for:
- Newspapers and periodicals where content is organised into distinct articles
- Magazines or journals where section-level navigation improves usability
- Projects prioritising user experience and structured browsing
Cost of data conversion
Data conversion is typically priced per page and varies depending on the structure of the material, the level of automation required, and the conversion approach selected.
Page-level conversion options are generally more cost-effective, while enhanced structural processing — such as audited text blocks or article segmentation — involves additional human review and therefore higher per-page costs.
We’ll help you understand the trade-offs between cost, structure, and usability, and ensure the selected option aligns with your collection and access goals.
Quality assurance
Quality assurance is a core part of our data conversion service. Before files are delivered, we review the outputs to identify issues that can affect usability, search accuracy, or long-term reliability.
Our quality checks focus on validating structure, consistency, and completeness — helping ensure the converted data meets agreed specifications and performs as expected in access and discovery environments.
This final review helps safeguard your collection and reduces the risk of issues being discovered later in the delivery or preservation workflow.