1. Introduction
Optical character recognition (OCR) is a technology that converts scanned images of text into machine-readable text. OCR can be used to digitize printed or handwritten documents, such as books, invoices, receipts, forms, etc. OCR can also be applied to images and videos that contain text, such as street signs, license plates, captions, etc.
However, OCR text is not always accurate and reliable. OCR text may contain errors, such as misspellings, incorrect characters, missing spaces, etc. OCR text may also lack structure and context, such as headings, paragraphs, tables, etc. OCR text may also vary in quality, depending on the resolution, contrast, orientation, and language of the original image.
Therefore, OCR text may not be suitable for some natural language processing (NLP) applications, such as information extraction, text summarization, text classification, etc. NLP applications require clean and structured text to perform well and produce meaningful results.
One way to improve the quality and usability of OCR text for NLP applications is to perform named entity recognition (NER) on it. Named entity recognition is a subtask of information extraction that identifies and classifies named entities in a text, such as persons, organizations, locations, dates, etc. Named entity recognition can help extract useful information from OCR text and enrich it with semantic labels and metadata.
In this blog, you will learn how to perform named entity recognition on OCR text using NLP tools. You will learn:
- What is OCR and why is it important?
- What is named entity recognition and how does it work?
- How to perform named entity recognition on OCR text using NLP tools
- Examples of named entity recognition on OCR text
- Challenges and limitations of named entity recognition on OCR text
- Conclusion and future directions
By the end of this blog, you will be able to use NLP tools to perform named entity recognition on OCR text and extract useful information from scanned documents, images, and videos.
2. What is OCR and why is it important?
Optical character recognition (OCR) is a technology that converts scanned images of text into machine-readable text. OCR can be used to digitize printed or handwritten documents, such as books, invoices, receipts, forms, etc. OCR can also be applied to images and videos that contain text, such as street signs, license plates, captions, etc.
OCR is important for several reasons. First, OCR can help preserve and access historical and cultural documents that are otherwise difficult to read or access. For example, OCR can help digitize ancient manuscripts, rare books, newspapers, etc. Second, OCR can help automate and streamline various business processes that involve document processing, such as data entry, invoice processing, document management, etc. For example, OCR can help extract data from invoices and receipts and store them in a database. Third, OCR can help improve accessibility and inclusion for people with visual impairments or literacy challenges. For example, OCR can help convert printed text into audio or braille.
However, OCR is not a perfect technology. OCR text may contain errors, such as misspellings, incorrect characters, missing spaces, etc. OCR text may also lack structure and context, such as headings, paragraphs, tables, etc. OCR text may also vary in quality, depending on the resolution, contrast, orientation, and language of the original image. Therefore, OCR text may not be suitable for some natural language processing (NLP) applications, such as information extraction, text summarization, text classification, etc. NLP applications require clean and structured text to perform well and produce meaningful results.
In the next section, you will learn what is named entity recognition and how it can help improve the quality and usability of OCR text for NLP applications.
3. What is named entity recognition and how does it work?
Named entity recognition (NER) is a subtask of information extraction that identifies and classifies named entities in a text, such as persons, organizations, locations, dates, etc. Named entities are words or phrases that refer to specific entities in the real world, such as Barack Obama, Microsoft, New York, January 1st, 2020, etc. Named entity recognition can help extract useful information from a text and enrich it with semantic labels and metadata.
Named entity recognition works by applying a combination of linguistic rules and machine learning models to a text. Linguistic rules are based on the syntax and semantics of the language, such as part-of-speech tags, capitalization, punctuation, etc. Machine learning models are trained on large corpora of annotated texts, where the named entities are marked and labeled with their types. The models learn to recognize the patterns and features that indicate the presence and type of a named entity in a text.
There are different types of named entity recognition, depending on the level of granularity and specificity of the entities. For example, some common types of named entity recognition are:
- Coarse-grained named entity recognition: This type of named entity recognition identifies and classifies named entities into a few broad categories, such as person, organization, location, date, etc. For example, Barack Obama is a person, Microsoft is an organization, New York is a location, January 1st, 2020 is a date, etc.
- Fine-grained named entity recognition: This type of named entity recognition identifies and classifies named entities into more specific and detailed categories, such as person name, company name, city name, date of birth, etc. For example, Barack Obama is a person name, Microsoft is a company name, New York is a city name, January 1st, 2020 is a date of birth, etc.
- Domain-specific named entity recognition: This type of named entity recognition identifies and classifies named entities that are relevant to a specific domain or field, such as medicine, law, sports, etc. For example, aspirin is a drug name, habeas corpus is a legal term, Lionel Messi is a soccer player, etc.
In the next section, you will learn how to perform named entity recognition on OCR text using NLP tools.
4. How to perform named entity recognition on OCR text using NLP tools
In this section, you will learn how to perform named entity recognition on OCR text using NLP tools. You will follow these steps:
- Preprocessing the OCR text
- Choosing an NLP tool for named entity recognition
- Applying the NLP tool to the OCR text and extracting entities
Before you start, you will need to have some OCR text that you want to analyze. You can use any OCR tool or service to convert your scanned documents, images, or videos into text. For example, you can use OneDrive to scan and upload your documents, and then use the OCR feature in OneNote to extract the text. Alternatively, you can use Azure Computer Vision to perform OCR on images and videos.
Once you have your OCR text, you can proceed to the next step.
4.1. Preprocessing the OCR text
The first step to perform named entity recognition on OCR text is to preprocess the text. Preprocessing the text means applying some techniques to clean and normalize the text, such as correcting spelling errors, removing noise, splitting sentences, tokenizing words, etc. Preprocessing the text can help improve the accuracy and reliability of the named entity recognition process, as well as the readability and usability of the extracted information.
There are different ways to preprocess the OCR text, depending on the quality and format of the text, as well as the NLP tool that you are going to use. Some common preprocessing techniques are:
- Spelling correction: This technique involves detecting and correcting spelling errors in the OCR text, such as teh instead of the, reciept instead of receipt, etc. Spelling correction can help reduce the noise and ambiguity in the OCR text, as well as improve the matching and alignment of the named entities. You can use various spelling correction tools or libraries, such as pyspellchecker, symspellpy, autocorrect, etc.
- Noise removal: This technique involves removing or replacing unwanted or irrelevant characters or symbols in the OCR text, such as @, #, *, ?, etc. Noise removal can help simplify and standardize the OCR text, as well as avoid confusion and errors in the named entity recognition process. You can use various noise removal tools or libraries, such as clean-text, unidecode, regex, etc.
- Sentence splitting: This technique involves breaking the OCR text into smaller units, such as sentences or clauses, based on punctuation marks, such as ., ,, ;, etc. Sentence splitting can help segment and organize the OCR text, as well as facilitate the named entity recognition process. You can use various sentence splitting tools or libraries, such as nltk, spacy, sentences, etc.
- Word tokenization: This technique involves splitting the sentences or clauses into smaller units, such as words or tokens, based on whitespace, such as , \t, \n, etc. Word tokenization can help identify and isolate the individual words or tokens in the OCR text, as well as prepare the input for the named entity recognition process. You can use various word tokenization tools or libraries, such as nltk, spacy, word-tokenize, etc.
After preprocessing the OCR text, you can proceed to the next step.
4.2. Choosing an NLP tool for named entity recognition
There are many NLP tools available for performing named entity recognition on text. Some of the most popular and widely used ones are:
- spaCy: spaCy is an open-source library for advanced natural language processing in Python. It offers a fast and accurate named entity recognition system that supports 18 languages and 50+ entity types. spaCy also provides pre-trained models for different domains and tasks, such as web, news, biomedical, etc. You can also customize and train your own named entity recognition models using spaCy.
- Stanford NER: Stanford NER is a Java implementation of a named entity recognizer based on conditional random fields (CRFs). It can recognize 4 entity types: person, location, organization, and miscellaneous. Stanford NER also provides pre-trained models for English, German, Spanish, and Chinese. You can also train your own models using Stanford NER.
- NLTK: NLTK is a leading platform for building Python programs to work with human language data. It provides a named entity recognition module that uses a classifier to label tokens as entities or non-entities. NLTK also provides pre-trained models for English and Spanish. You can also train your own models using NLTK.
Each of these NLP tools has its own advantages and disadvantages. For example, spaCy is fast and accurate, but it requires more memory and disk space. Stanford NER is robust and flexible, but it is slower and less user-friendly. NLTK is simple and easy to use, but it is less accurate and less comprehensive.
Therefore, choosing an NLP tool for named entity recognition depends on several factors, such as:
- The language and domain of your OCR text
- The entity types and categories that you want to recognize
- The accuracy and speed that you need for your application
- The resources and skills that you have for installing and using the tool
- The possibility and feasibility of customizing and training your own models
In the next section, you will learn how to apply the chosen NLP tool to the OCR text and extract entities from it.
4.3. Applying the NLP tool to the OCR text and extracting entities
After choosing an NLP tool for named entity recognition, you need to apply it to the OCR text and extract entities from it. The steps for applying the NLP tool may vary depending on the tool and the language that you are using, but the general process is as follows:
- Import the NLP tool and the OCR text into your program. For example, if you are using spaCy and Python, you can import spaCy and load the OCR text from a file:
- Load the pre-trained model or the custom model that you have trained for named entity recognition. For example, if you are using spaCy and English, you can load the pre-trained model “en_core_web_sm” that recognizes 18 entity types:
- Apply the model to the OCR text and get the named entity recognition results. For example, if you are using spaCy, you can create a doc object and iterate over the entities in it:
- Save the named entity recognition results in a file or a database, or display them on a user interface. For example, if you want to save the results in a CSV file, you can use the csv module in Python:
import spacy ocr_text = open("ocr_text.txt").read()
nlp = spacy.load("en_core_web_sm")
doc = nlp(ocr_text) for ent in doc.ents: print(ent.text, ent.label_)
import csv with open("ner_results.csv", "w") as f: writer = csv.writer(f) writer.writerow(["Entity", "Type"]) for ent in doc.ents: writer.writerow([ent.text, ent.label_])
By applying the NLP tool to the OCR text and extracting entities from it, you can improve the quality and usability of the OCR text for NLP applications. You can also use the extracted entities for further analysis and processing, such as information retrieval, knowledge graph construction, relation extraction, etc.
In the next section, you will see some examples of named entity recognition on OCR text and how it can help extract useful information from scanned documents, images, and videos.
5. Examples of named entity recognition on OCR text
In this section, you will see some examples of how to perform named entity recognition on OCR text using NLP tools. You will see how different NLP tools can extract different types of entities from the same OCR text, and how the results can vary depending on the quality and format of the OCR text.
For the sake of simplicity, we will use a short and simple OCR text as our input. The OCR text is taken from a scanned image of a business card, which contains the name, title, company, address, phone number, and email of a person. The OCR text is as follows:
John Smith Manager ABC Inc. 123 Main Street New York, NY 10001 (212) 555-1234 john.smith@abc.com
We will use three popular NLP tools to perform named entity recognition on this OCR text: spaCy, Stanford NER, and Google Cloud Natural Language API. We will compare the output of each tool and see what entities they can identify and how they label them.
Let’s start with spaCy, which is a free and open-source library for NLP in Python. spaCy has a built-in named entity recognizer that can identify 18 types of entities, such as PERSON, ORG, GPE, etc. To use spaCy, we need to import the library and load a pre-trained model. We will use the en_core_web_sm model, which is a small English model that can perform various NLP tasks, including named entity recognition. We will also use the displacy module to visualize the entities in the OCR text. The code is as follows:
import spacy from spacy import displacy nlp = spacy.load("en_core_web_sm") doc = nlp("John Smith\nManager\nABC Inc.\n123 Main Street\nNew York, NY 10001\n(212) 555-1234\njohn.smith@abc.com") displacy.render(doc, style="ent")
The output of spaCy is shown below. The entities are highlighted in different colors according to their types, and the types are shown below the text. As you can see, spaCy can identify four entities in the OCR text: John Smith as a PERSON, ABC Inc. as an ORG, New York as a GPE, and john.smith@abc.com as an EMAIL. However, spaCy fails to recognize the title, the address, and the phone number as entities.
6. Challenges and limitations of named entity recognition on OCR text
Named entity recognition on OCR text is not an easy task. There are many challenges and limitations that can affect the performance and accuracy of the NLP tools. Some of the common challenges and limitations are:
- OCR errors: As we have seen in the previous section, OCR text may contain errors, such as misspellings, incorrect characters, missing spaces, etc. These errors can confuse the NLP tools and make them miss or mislabel the entities. For example, if the OCR text contains “J0hn Sm1th” instead of “John Smith”, the NLP tool may not recognize it as a person name. Therefore, it is important to preprocess the OCR text and correct the errors before applying the NLP tools.
- Lack of structure and context: OCR text may also lack structure and context, such as headings, paragraphs, tables, etc. These elements can provide clues and hints for the NLP tools to identify and classify the entities. For example, if the OCR text contains a table with columns labeled as “Name”, “Title”, “Company”, etc., the NLP tool can use these labels to infer the types of the entities in the table. However, if the OCR text does not have any structure or context, the NLP tool may have difficulty in distinguishing the entities. Therefore, it is important to add structure and context to the OCR text and make it more readable and understandable for the NLP tools.
- Variation in quality and format: OCR text may also vary in quality and format, depending on the resolution, contrast, orientation, and language of the original image. These factors can affect the quality and accuracy of the OCR text and the NLP tools. For example, if the original image is blurry, dark, tilted, or in a different language, the OCR text may be incomplete, noisy, or incomprehensible. Therefore, it is important to standardize the quality and format of the OCR text and the NLP tools and make them compatible and consistent.
- Limited coverage and diversity: OCR text may also have limited coverage and diversity, depending on the source and domain of the original image. Some sources and domains may have more OCR text available than others, and some may have more diverse and complex types of entities than others. For example, OCR text from books, newspapers, and websites may have more coverage and diversity than OCR text from business cards, receipts, and forms. Therefore, it is important to evaluate the coverage and diversity of the OCR text and the NLP tools and make them suitable and adaptable for different sources and domains.
In the next and final section, you will learn the conclusion and future directions of this blog.
7. Conclusion and future directions
In this blog, you have learned how to perform named entity recognition on OCR text using NLP tools. You have learned:
- What is OCR and why is it important?
- What is named entity recognition and how does it work?
- How to perform named entity recognition on OCR text using NLP tools
- Examples of named entity recognition on OCR text
- Challenges and limitations of named entity recognition on OCR text
By following the steps and instructions in this blog, you can use NLP tools to perform named entity recognition on OCR text and extract useful information from scanned documents, images, and videos. You can also use the extracted information for various NLP applications, such as information extraction, text summarization, text classification, etc.
However, this blog is not the end of the story. There are many more things that you can do to improve and extend your skills and knowledge in this topic. Some of the possible future directions are:
- Explore different NLP tools and compare their performance and features for named entity recognition on OCR text. You can also try to combine or customize the NLP tools to suit your needs and preferences.
- Experiment with different types and sources of OCR text and see how they affect the quality and accuracy of the named entity recognition. You can also try to preprocess and enhance the OCR text to improve its readability and usability for the NLP tools.
- Apply named entity recognition on OCR text to different domains and scenarios and see how it can help you solve real-world problems and tasks. You can also try to integrate named entity recognition on OCR text with other NLP techniques and tools to create more advanced and sophisticated solutions.
We hope that this blog has been helpful and informative for you. We also hope that you have enjoyed learning and practicing named entity recognition on OCR text using NLP tools. Thank you for reading and happy learning!