While most Veridian collections focus on historical material, organisations that hold content in born-digital PDF format can ingest this directly into a Veridian collection—without needing to convert them into METS/ALTO. This can significantly reduce costs while still providing a sufficient level of usability for most use cases.
What are born-digital PDFs?
A born-digital PDF is a file that was originally created in digital form and saved directly as a PDF—unlike a scanned PDF, which is generated by scanning source content and applying OCR (optical character recognition) to extract the text.
A key difference between born-digital PDFs vs. scanned PDFs lies in the reading order of the text. In scanned PDFs, OCR software can sometimes struggle to correctly interpret complex layouts—such as multi-column formats, headlines, sidebars, and image placements—resulting in a jumbled or illogical reading order. Born-digital PDFs, by contrast, have the correct reading order embedded in their structure from the outset.
This is a major advantage, as one of the most time-consuming—and therefore costly—steps in preparing METS/ALTO is ensuring that text blocks appear in the correct reading order. With born-digital PDFs, this step is already taken care of.
When did born-digital PDFs become common in publishing?
While created in 1993 (by Adobe), born-digital PDFs began gaining traction in the late 1990s with the rise of desktop publishing tools, but only became common in mainstream publishing in the early to mid-2000s.
If a collection includes content published from around 2005 onwards—particularly newspapers, internal archives, magazines, community newsletters, or smaller regional titles—there’s a good chance that born-digital PDFs may be available.
What does the ingestion process look like?
Veridian provides clients with file naming conventions and directory structure guidelines for preparing batches of born-digital PDFs. Once the content is organised according to these specifications, it can be sent to Veridian for ingestion.
For example, a standard directory structure for daily newspaper issues may look like this:
<Batch Name>
└── <Publication Code>
└── <Year>
└── <Month>
└── <Day>_<Edition>
└── <issue-level PDF>
An actual example for The Daily News, covering multiple editions in June 2012, might be:
Batch1
└── TDN
└── 2012
└── 06
├── 22_01
│ └── TDN-2012-06-22_01.pdf
├── 23_01
│ └── TDN-2012-06-23_01.pdf
├── 23_02
│ └── TDN-2012-06-23_02.pdf
└── 24_01
└── TDN-2012-06-24_01.pdf
Our general recommendation for born-digital PDFs
If born-digital PDFs are already available and producing METS/ALTO is not a strict requirement, ingesting those PDFs directly into Veridian can be a practical and cost-effective alternative.
What about scanned PDFs?
For content produced prior to the digital publishing age, we strongly recommend continuing with the industry-standard approach: converting source materials into METS/ALTO. This ensures the highest possible quality and functionality for digital collections.
While scanning to PDF may seem like a lower-cost option upfront, the actual difference in cost between producing scanned PDFs and creating METS/ALTO is often minimal—while the difference in quality can be substantial.
That said, some organisations may hold large numbers of scanned PDFs from previous projects. In these special cases—where a significant investment has already been made—direct ingestion into Veridian may be worth considering. Technically, it is possible. However, collection custodians should be aware that the final product will be of considerably lower quality. Reading order is likely to be incorrect, and other layout and text recognition issues may further impact usability.
In short, we recommend the use of page- or article-level METS/ALTO over scanned PDFs in nearly all cases.
Please contact us if you have any questions or would like to discuss this topic further–we're here to help.