A yearbook is a type of a book published annually to record, highlight, and commemorate the past year of a school.
Our team at MyHeritage took on a complex project: extracting individual pictures, names, and ages from hundreds of thousands of yearbooks, structuring the data, and creating a searchable index that covers the majority of US schools between the years 1890–1979 — more than 290 million individuals. In this article I’ll describe what problems we encountered during this project and how we solved them.
First of all, let me explain why we needed to tackle this challenge. MyHeritage is a genealogy platform that provides access to almost 10 billion historical records. Using the SuperSearch™ function, users can search these records to discover new information about their families and photographs featuring their relatives. Furthermore, MyHeritage’s Record Matching technology automatically notifies users when historical records match information on their family trees, saving them the need to actively search the archives. Each record collection we add can lead to countless new discoveries for our users.
Our objective was to separate elements that could be done algorithmically from those that needed to be done manually. Every feasible algorithmic solution would be much cheaper than engaging in a time-consuming and costly manual data entry project. The ultimate goal of this project was to have every name in those books matched to its corresponding image. Also, ideally, we wanted to know which grade each student was in during a specific year, which would allow us to infer his or her age. We knew these goals would take a lifetime to achieve 100%. Therefore, we wanted to achieve the maximum results possible within a reasonable budget and time frame.
As source data, we had 36 million images and the same number of XML files as a result of the OCR process.
OCR (Optical Character Recognition) is the electronic conversion of scanned documents to machine-encoded text. To produce OCR data from books, they must first be digitized using special book scanners. The resulting images must then be processed by an OCR engine.
To make the structured index, we had to solve four main problems:
- Recognize portraits on the page.
- Extract people’s names from OCR results.
- Correlate extracted names with portraits.
- Predict people’s ages.
1. Portrait recognition
The first task was to try existing face detection algorithms, but they didn’t work well with old scanned black and white pictures. Haar Cascades worked perfectly in some cases, but completely failed in others.
The other option was to recognize rectangular dark shapes on the image, since portraits are usually dark rectangles on a bright background — a classic computer vision problem.
Canny Edge Detection is a popular edge detection algorithm. Developed by John F. Canny in 1986, this algorithm can be broken down into different stages: noise reduction, finding the intensity gradient of the image, applying non-maximum suppression, and applying hysteresis thresholding. At the resulting edges, we looked for potential rectangles and rejected false-positives by analyzing positions and squares.
At this point, we started to build a scalable environment in AWS in order to process 36 million images. As a result of this process, portraits’ coordinates and their relations to pages were stored in the RDS database. At the same time, we built an interface for the central database to monitor and test the results. Portrait detection is a CPU-intensive process and we used a cluster of powerful Ec2 instances driven by SQS.
2. Name extraction
NER algorithms seek to locate and classify named entity mentions in an unstructured text into predefined categories, such as a person’s name, organizations, locations etc., with the goal of producing near-human performance.
NER systems have been created that use linguistic grammar-based techniques as well as statistical models, such as machine learning. There are many use cases of NER, but in our case, it solved our name extraction problem. After extracting the names, we analyzed their positions and proportions with the portraits’ numbers to get rid of false positives.
3. Name and portrait correlation
Once we had pinpointed the positions of the names and portraits, we wanted to link them so we could tell where a certain person’s portrait is. This non-trivial problem was further amplified by the fact that names and portraits could not be detected in an ideal fashion. There were always false-positive names, some portraits were missed, and so on. We tried to understand how the human brain solves this problem when looking at a yearbook page; this led us to build an algorithm that predicts page structure, guesses correlations between names and portraits, and fixes errors.
The result of this algorithm was an array of logical links between names and portraits, but it, too, was not ideal and we could never be completely sure of its accuracy without inspecting a huge number of pages. To solve this, we built an app to browse the pages and fix discrepancies.
At this step, hundreds of operators reviewed millions of pages and fixed any inconsistencies they found. This required a powerful RDS database and a scalable cluster of web-instances where web application was used. Due to a vendor geographical location, transferring the entire image set into AWS Asia Pacific (Mumbai) Region made the process much faster.
4. Age deduction
In order to build a better searchable index, we wanted to know students’ ages and extrapolate their birth years from this information. How to do this? One of the options was to use AI algorithms that could guess the age of an individual by an image of that person’s face. However, the algorithms we tested were completely useless for pictures in yearbooks. The other option was to infer students’ grades from text titles on the pages and guess age by grade. For example, freshmen (9th Grade) usually are 14–15 years old, juniors (11th Grade) — 16–17 years old, and so on.
We moved in this direction and built an algorithm that extracts titles from OCR’ed pages. Based on font size, phrase length, and position, the algorithm decides which text blocks represent titles.
In extracted titles, we looked for a special pattern like “Sophomores,” “Seniors,” “7th Grade,” and many others to infer book structure and predict which section of the book a certain student belongs to. In combination with book metadata, this allowed us to assign approximate birth years to students.
To test the results, we built an interface for the QA team which allowed random manual grade indexing and comparison with algorithm results.
Putting it all together
The next step was to extract actual portrait images from page images by their coordinates and create thumbnails of pages. This was done by a Node.JS process running on Ec2 instances coordinated by SQS. This process also reduced redundancies, since repeat errors are often the result of a name appearing twice or more on different pages in a book. The resulting images were stored on S3.
Once we had all assets and metadata prepared, we built searchable records and indexed them within Apache Solr, making this one of the biggest historical record collections in MyHeritage’s collection catalog.
Thanks for reading.
MyHeritage — powerful online genealogy platform.
Yearbook webinar — great webinar with many interesting details about school yearbooks.
Yearbook and copyright — article that covers yearbooks copyright questions.
Named-Entity Recognition on Wikipedia