Source: Andornot Blog

Andornot Blog Improving Search Filters with Named Entity Recognition (NER)

Named Entity Recognition (NER) is a valuable tool for identifying names within a text, such as the names of peoples and places. Our summer 2024 intern, Jarrett MacFarlane, explored uses of NER with full text resources that have not been catalogued or described by a librarian or archivist. The intention was to extract names that could be used to filter a search, or to present a browsable index of names, for use in our Andornot Discovery Interface search engine. For the purpose of this project, he worked with local Ontario newspapers dating from the late 19th and early 20th century.Jarrett developed an NER script which uses the natural language processing (NLP) library spaCy to automatically pick out names from documents. The script can then group similar names that are likely to refer to the same thing, and then can check place names against existing datasets of place names for accuracy. Depending on the quality of some historical documents, names may appear in many variations, with misspellings, differing formats, or partial matches. For example, "M. A. Smith," "Mary A. Smith," and "Mary A. Srnith" (an OCR error) are likely to all refer to the same person.Some challenges with performing NER on historical documents include:Extracting names from historical newspapers where optical character recognition (OCR) may have introduced errors in spelling due to the poor condition of the original document.Ensuring the accurate representation of place names in cultural collections.Our NER approach addresses these issues by first processing the text using spaCy's transformer model to extract the names. We can then group together names that are partial matches, or likely to represent the same thing, using a technique called Levenstein distance to identify likely matches. For example, in historical newspapers, "Wm. Hart" would be a common shortening for "William Hart." Because these two forms of the same name share a lot of letters in common and in the same order, they will have a very close Levenstein distance, which means we can identify them as likely to be the same name and group them together. This allows us to eliminate minor variations in the final name filters, and ideally present the most relevant form of the name for user filtering. For place names, we can check them against external datasets of place names, such as the Canadian Geographical Names Database created by Natural Resources Canada, which includes the Indigenous Place Names dataset. This helps ensure that culturally significant names are properly represented.Jarrett's work with NER this summer will appear in upcoming projects with digitized newspaper collections, and is available to be applied retroactively to existing sites powered by our Andornot Discovery Interface.

Est. Annual Revenue

$5.0-25M

Est. Employees

25-100

CEO

Update CEO

CEO Approval Rating

- -/100

Andornot provides digitization, website design, library and media management solutions for public institutions, government organizations and museums.