What is crowdsourced User Text Correction (UTC)?
The Veridian User Text Correction (UTC) module was developed to improve the “searchability” of our digital archives. The feature allows readers to correct errors in the text of digitized documents within an online collection.
UTC improves the accuracy of the text, which enables better search results and a richer experience for all users. Anyone can participate in UTC for a digital collection, as long as they have created an account and logged in.
Why is UTC needed?
When a newspaper or other text document is prepared for display online, Optical Character Recognition (OCR) software is used to generate searchable text.
On some occasions, the text can be difficult for the software to accurately register - for example, a torn or water damaged document may have some illegible or missing sections. Likewise, small fonts, imperfect and out of focus scans, fading, and many other factors can result in OCR text that is not 100% accurate. UTC allows users to correct these OCR errors as they come across them in the text, as shown in the example below:
What are the benefits of UTC?
As OCR text corrections are saved to the database, it gradually improves the quality of the collection and increases search result accuracy. The UTC feature also helps libraries and other organisations to build engaged online communities around their collections.
Since introducing this feature, Veridian digital collections have enjoyed more repeat visitors, a longer average duration of visits and increased total visits. Read more about the benefits of UTC here.
What about text vandalism?
We’re often asked what prevents users from maliciously vandalizing or corrupting the text in these collections. We’ve found occurrences of vandalism to be extremely rare, but have put measures in place to mitigate the risk where possible. For example, requiring users to register before making any corrections.
Veridian doesn’t modify the original source data produced with the OCR process, so it’s always possible to roll back to the original text if it’s necessary to do so. You can read more of our thoughts on text vandalism here.
Getting started - instructions for how to correct OCR text
Create an account
You must register and create a user account for the collection you wish to contribute to. A verification email will be sent to your email address. Once verified, log in to the collection to get started correcting OCR text.
Access the text correction interface
The text correction interface is split into two parts: the right side shows the page images that make up the document, and the left side is used for editing the lines of text.
When you move your mouse over the page images in the right pane, the blocks making up the pages will highlight. You can scroll this view by dragging with the mouse, or zoom in/out using the buttons above the viewer. Clicking a highlighted block will select it and load a form for editing that block into the left pane.
How to make text corrections
There are two ways you can begin to correct text from the document viewer:
- Select the article or page you want to correct. This will display the text in the left pane of the document viewer. Click on the "Correct this text" link that appears above this text.
- Right-click on the article or page image and select "Correct article text" or "Correct page text" from the options pop-up window.
Correct the text line by line. A red box is displayed in the right pane to help you determine what text should be included in the line.
Hint: Many web browsers include spell checking functionality and this can assist with your text correction by identifying misspelled words. If your web browser does not have this functionality, it's likely there is a spell checking add-on available (see your web browser's help for information on how to install add-ons).
Save your corrections
Once you have finished correcting text, click "Save". The changes you make will take effect immediately.
You can then make further corrections to the same block, move on to the next block by clicking the "Save & next" or "Next" button, select another block in the right pane, or exit the text correction view by clicking the "Return to viewing mode" link.
Clicking "Save & exit" instead of "Save" will save the changes and then return you to the normal viewing mode automatically.
Below is a list of correction conventions that can be applied by text correctors. This has been developed collaboratively working with the Veridian user group. This information is intended as an initial reference for those wanting to get involved with text correction. Please refer to the ‘Help’ section of individual collections to learn more about conventions particular to their material.
A rule of thumb: type what you see
In general, it is recommended that you transcribe what you see, following the order and layout of the original document as best you can. Keep in mind that transcribing the text will improve our ability to read the documents, search for them, and use the information they contain.
Saving your work
Save your corrections regularly. The system will prompt you to save your work every five minutes. The changes you make will take effect immediately.
'This block is completely correct' checkbox
Once you have completed corrections for a block of text, please check the ‘This block is completely correct’ box.
This information informs the text correction statistics data for the collection you are working on. It’s also helpful for other users to see where someone else has already finished a section. Further explanation of the text correction statistics available is below.
Misspellings on the original printed page
Your transcription should preserve the spelling, grammar and word order of the original document. If you come across a spelling error, type the word as printed and follow this with the correct spelling in square brackets to improve searchability.
Punctuation and capitalisation should reflect what is published.
If a word is hyphenated because it is split across two lines, type as it appears in the image, e.g. "hyphen-" at the end of the first line, and "ated" at the start of the second line. Veridian will automatically join the two parts of the word together for search purposes. Hyphens that appear elsewhere in the text should also match the image.
Blank spaces and miscellaneous punctuation and symbols
It is not necessary to correct these issues as they do not affect searching. Some users like to clean up these types of OCR mistakes for appearance sake.
Comments and Tags
Please add any of your own comments using the ‘comment’ feature rather than within the transcription area.
Tags can be browsed and used to narrow down searches into subject areas.
To add a public comment or tag, select a page/article in the document viewer by clicking on it. Scroll to the bottom of the left hand pane displaying the text pane until you reach the "Add Comment" and “Add Tag” sections. Add your comment/tag into the corresponding text box and click "Add comment/tag".
To view your recent tags and comments visit the “My account” page and choose the "Recently added tags/comments" section from the "Recent activity" tab.
Occasionally it may not be possible for the OCR software to read a line or several words, for example where the original document is faded or damaged. If you are unable to make out the original word use square brackets to indicate [illegible] text. Another user might come along and be able to read it.
A block should still be marked as "completely correct" even if it contains some text marked as [illegible]. This indicates that one user believes the block is as correct as it is possible to make it. One intention of the checkbox is to eventually make it easier to locate blocks that have not been checked. And of course, if we mark illegible text with [illegible] it will be possible to search for blocks containing illegible text, if it's ever necessary or useful to locate them.
Images and Illustrations
Images and illustrations are commonly picked up by the OCR software as blocks with no text. As a general guide to handle these image blocks, correctors can use [image] to identify this block as an image. In most cases, there will be a caption block below the image to describe the image content. In case there isn't one, the suggestion is to add the description within [image]. See examples below:
- [image: Photo - Senator James P. Davensofa]
- [image: Map - Northern Eriesoil]
- [image: Drawing - residential (1st flight) plan(e)]
The block can then be marked as "completely correct".
If you find corrections that are not related to the original text you may correct them back to the original text. If the corrections appear as intentional vandalism please report to the collection owner, or email firstname.lastname@example.org
Missing and cut off lines
Occasionally, a line needing correcting will be skipped. If you come across this problem, you can still make corrections. Simply add the missing line of text to the end of the line above. If there is no preceding line, add the text to the start of the following line. Where possible make sure that the start of each line matches the start of the original line of text.
What if a "line" of text crosses over two or more columns?
As with missing and cut off lines, do your best to transcribe the text in the correct reading order.
Text correction statistics
A range of text correction statistics is available to track the progress of user community contributions. These statistics can also help guide where corrections are needed next within the collection and increase user engagement.
Track UTC progress
From the collection homepage or within the ‘My Account’ page, click on ‘more information’ to see a range of UTC statistics such as the number of users who have corrected text, total number of lines corrected and information about your own contributions.
'Completely Correct' statistics
Here you can view the quantity of completely correct blocks, articles, pages and issues within a collection.
Text Correctors Hall of Fame
This section rewards the most active UTC volunteers with a top placing. The top 100 text correctors are listed by name/pseudonym alongside the number of lines they have corrected.
Recommended items to be corrected
A link to the ‘Recommended items to be corrected’ can be found on the collection homepage or within ‘My Account’ page under ‘Contributions’.
The list of recommendations is arranged into tabs displaying the newspaper articles, pages and issues closest to having OCR errors completely correct. Each tab shows how “complete” the text correction process is by percentage. This gives an indication of where text correctors might best concentrate their efforts to achieve an OCR error-free section.
Note: The text correction recommendations page is new functionality and may not be available on all collection sites yet.