While many of our collections are historical, with newspapers dating back centuries, some feature more recent issues produced since the dawn of software technology like PDFs. If you have contemporary newspapers in born-digital PDF format, we may be able to ingest them into your Veridian collection without having to create METS/ALTO data, as per our usual process.
What are born-digital PDFs vs scanned PDFs?
A born-digital PDF newspaper (or other content) is defined as being originally produced in digital form and saved as a PDF, as opposed to a scanned / OCR'd PDF which was created originally from a print form of a newspaper, then converted via scanning / OCR into a PDF.
A key difference between born-digital vs OCR'd PDFs is that the reading order of the text is often jumbled up in OCR'd PDFs, whereas in born-digital PDFs it is normally correct. One of the key steps in creating page-level METS/ALTO is making sure the text blocks which make up each page are in the correct reading order, so this is already complete with born-digital PDFs. Note that from this point onward, we will refer to PDFs created from a scanning / OCR process as ‘scanned PDFs.’
Born digital PDFs are typically only available for contemporary newspapers, produced after the PDF format became popular. For example, if the latest newspaper in a collection is from the 1980s (before PDFs were invented), this probably won't apply to existing content in that collection. However, it may apply to newly obtained content, if that includes newspapers produced within the last 20 years or so.
Why skip the METS/ALTO process and ingest born-digital PDFs into Veridian?
The main benefit of skipping the METS/ALTO process comes down to cost. Loading born-digital PDFs (if you have them) directly into Veridian will cost a lot less than producing page-level METS/ALTO, while achieving a comparable result.
Veridian has the capability to ingest born-digital PDFs and achieve essentially the same result at a presentation level as if the source data was page-level METS/ALTO.
What does the process of preparing a batch of born-digital PDFs to be ingested in a Veridian collection look like?
We will give you a set of naming rules for the PDF files and specifications for a uniform directory structure to put them in, and then you can send batches like that to us.
As an example, a suitable uniform directory structure for day-level (as opposed to month-level or year-level) newspaper issues would look something like this:
└── <Publication Code>
└── <issue-level PDF>
e.g. For issues of The Daily News, in June 2012, including two separate editions published on the 23rd of June:
│ └── TDN-2012-06-22_01.pdf
│ └── TDN-2012-06-23_01.pdf
│ └── TDN-2012-06-23_02.pdf
What is our general recommendation for born-digital PDFs?
For historical newspapers that were published before the born-digital era, we strongly recommend that you continue with the same process of converting source data to page-level or article-level METS/ALTO. It’s the industry standard, and will mean the best outcome for your collection.
However, if you do have contemporary newspapers in the form of born-digital PDFs, you may be interested in looking at this option to ingest born digital PDFs directly, if obtaining page-level METS/ALTO for this data is not a requirement.
What about the alternate possibility of directly ingesting lower quality scanned PDFs?
For new scanning projects, page-level or article-level METS/ALTO should always be produced instead of scanned PDFs. This is due to both cost (of page-level METS/ALTO), which shouldn’t be much higher than only producing scanned PDFs, and because METS/ALTO is the industry standard data format for digitized newspapers.
To put this another way, for new digitization projects, scanning to PDF and loading those PDFs to Veridian is definitely not a desirable low-cost option when performing new digitization of newspapers, because the end product loses all the advantages of METS/ALTO for little difference in price.
However, as an option only for projects that already have large numbers of low quality scanned PDFs, it will sometimes make financial sense to make the best of those existing files and ingest them directly into Veridian, given the investment that already went into creating them.
For projects in this special situation there is nothing technically stopping direct ingestion of scanned PDFs, but the quality of the end product will be significantly lower than using page-level METS/ALTO. Specifically, the text reading order probably won't be correct, and there will likely be a number of other problems with the data which add up to a significantly lower quality outcome for the end user.
For these reasons (and because METS/ALTO is industry standard), for the vast majority of projects we strongly recommend the use of at least page-level METS/ALTO, instead of ingesting low quality scanned PDFs.