Moving from Traditional to Cognitive OCR

Posted 23-10-2020

Cognitive OCR

Using Cognitive OCR to identify data is a progressive way to extract data from documents. Artificial Intelligence is a way to recreate human intelligence by enabling a machine to read the document like how a human would. The idea is to imitate a human in a more profound way. 

Data extraction in the past has been mostly done manually with a separate bunch of employees in the company taking care of data entry. The manual process ends up being slow and is not something companies can scale on. This ends up wasting a lot of resources by making them do the same redundant job again and again. Invoices, for instance, have characteristics that are not shared by any other documents. They have numeric data present in a variety of ways. Fields such as the client, supplier, invoice number, etc. are usually present at some common position in the document which makes it easy for humans to identify and understand. 

Template Based OCR

This manual entry soon transcended to Template Based OCR where Rules were set up to extract a certain set of data. This way of extracting data from a document is Template based. OCR in collaboration with the Set of Rules can be quite accurate and helps in recognizing a set of data amidst images and documents. The escalation from Manual Data Entry to Template Based Data Extraction solved a lot of problems and worked pretty well for documents with a similar structural format. The accuracy takes a toll while dealing with documents having a different format structure. 

This solution for data extraction takes a toll when a company deals with a lot of vendors and each of them follows a different data format/structure. This results in creating a different template set for each Vendor which is time-consuming and results in inaccurate data extraction. The entire process of setting up a template can be time-consuming and is quite expensive. This made people look into Cognitive OCR as a viable option for Data Capture.

Extracting Data in Template-Based OCR

Every field extracted from a document in a template-based OCR is configured separately. The extraction takes place by scanning the coordinates from a document. This works as long as the document structure does not change eg. Invoices. If the structure of the invoice deviates from what you have defined in a template then it is important to go for the re-definition of the template. 

Extracting Amount Due
Following the rule: 

  1. Search for Keyword: Net Amount Due 
  2. Set the Position to Below the keyword 

Cognitive OCR

Following the same set of rules for an invoice that follow the same structure will successfully extract the result for the field Net Amount Due. The result can, however, vary if the structure or format of the document changes even a bit. 

Cognitive OCR

The same invoice in the above case will extract a false result and needs a person to redefine the template. 

Redefined Template Rules: 

  1. Search for Keyword: Net Amount Due 
  2. Set the Position to Below the keyword 
  3. Ignore Currency: 

This is where the Template-Based OCR shows a Limitation. Dealing with a lot of invoices can be a problem and requires a template definition for each vendor. 

Limitations faced by Template-Based OCR:

  1. Template Definition is time-consuming 
  2. Data extracted is accurate as long as the document follows the same structure 
  3. This process is not scalable
  4. Data Extracted through Template Based OCR needs re-verification
  5. This extraction does not work for unstructured documents

What is Cognitive Data Capture?

Cognitive Data Capture is a new way of extracting data by using intelligent information. This capture technique uses AI and OCR in collaboration to extract data. It works great for companies that work with unstructured documents where data does not follow any uniformity in its structure. The use of AI in Cognitive Data ensures that the data capture mimics the human mind and learns to perfect data extraction as it keeps on getting exposed to different structures of documents and invoices. This process is less manual which further helps in perfecting the data capture extraction. Cognitive Data Capture apart from being more accurate requires less manpower which again saves a lot of time, cost and errors. 

Extracting Data Using Cognitive OCR
A Cognitive OCR along with AI is trained with thousands of rules and looks at the invoice like how a human would. It identifies the Total Amount Due by just processing it once, without having to define template-based rules for that. Using a Cognitive Data Capture method is much more scalable and works on documents irrespective of the structure they follow. This approach, however, requires building a highly trained AI which requires thousands of data sets to get trained at 100% accuracy. 

We at DocAcquire use AI to recognize all data patterns present in documents and invoices nowadays. It identifies invoices and documents exactly like how a human mind would understand and extract data from a document. DocAcquire has already covered a vast range of document layouts while taking care of accuracy at the same time. 

Want to lay hands on DocAcquire? Sign up for a free trial today!

Back to blog

Latest articles

blog

Document Chat: An AI-Powered Document Assistant

In today’s fast-paced business world, companies are always seeking innovative ways to streamline operations, improve efficiency, and foster better communication—both internally and...

Read article
blog

7 Tips to Streamline Accounts Payable Process

Do your accounts payable department give you a headache? Are you procrastinating on sorting your invoices? You are not alone! Most business owners loathe the invoice handling process, it may seem...

Read article
blog

How to Optimise Remote Invoice Management?

The Covid-19 pandemic brought “the new normal” along with it. People now don’t go out unnecessarily, businesses are working remotely, schools and colleges are taking online classes, and...

Read article
blog

The easiest way to extract tables from pdf?

One of the most popular document formats to share and write data is PDF. You may come across millions of situations where you must extract table from PDFs or scanned documents. There are online...

Read article
blog

Moving from Traditional to Cognitive OCR

Using Cognitive OCR to identify data is a progressive way to extract data from documents. Artificial Intelligence is a way to recreate human intelligence by enabling a machine to read the...

Read article
blog

Automated Invoice Scanning & Invoice Capture Software

This article discusses invoice capture software and its application in improving your business processes. It explains how does invoice scanning and capturing eliminate the need for manual keying of...

Read article