New Search Feature: Optical Character Recognition (OCR)
This post is from the September 5th issue of the National Archives Catalog Newsletter.
With more than 92 million pages of digitized records available to search in the National Archives Catalog, we are always working on ways to improve search results to better help you find what you’re looking for.
That’s why we’re excited to share a new feature in the Catalog: Optical Character Recognition, or OCR.
What is OCR?
OCR converts images that contain typed, handwritten, or printed text into text that can be read and searched by a computer.
Previously, records in the Catalog were only searchable based on the titles, descriptions, and other fields entered by archivists, or by tags and transcriptions entered by citizen archivists. Now, with OCR capability, text from some images in the Catalog can be extracted, making that text searchable and more likely to come up in your search results.
Currently, the Catalog’s new OCR engine is applied to records in either JPG or PDF format added to the Catalog since June 2019. NARA is exploring how to retroactively process records from before that point, but right now this feature applies to millions of pages!
Here’s what you can expect and how it works:
A search for “Melvin H. Coulston” returns this Bureau of Indian Affairs record with OCR data. The search term is bolded.
SEARCH TIP – If you are searching for a name or phrase, surround it in quotation marks to do an exact phrase search.
To explore the results further, click on the blue description title. On the description page that follows, you can see the pages where your search term is found. They are listed below the description title and to the left of the image viewer:
Likewise, you may also see the page thumbnails highlighted beneath the image viewer that contain your search term. Clicking on any of the pages in the list or a highlighted page thumbnail will take you to that page.
Try it out!
You can test the capability yourself by running one of the following searches and clicking the first result returned for each:
We still have work to do! Right now, we are investigating options to re-process items for OCR that were in the Catalog prior to June 2019. Additionally, records that are only available in PDF format currently do not provide the page jumping or highlighting capability.
OCR is not perfect! While this technology helps to make records more searchable, we still find human-entered transcriptions to be more accurate than OCR, so we still need your help as citizen archivists to transcribe records in the Catalog, and help decipher that tricky handwriting!
In case you were interested…
Technical specs: NARA’s new OCR engine is powered by the open source Tesseract software. As records are added to NARA’s Amazon Web Services (AWS) S3 cloud storage, it is run through image processing powered by a series of AWS Lambda functions.