Extract text from pdf – Automate & free up your time

Posted 10-01-2019

What is PDF?

PDF (Portable Document Format) is a file format that is used to present and exchange documents reliably, independent of software, hardware, or operating system. PDF was invented by Adobe and is now an open standard maintained by ISO. Nowadays PDF files are compatible and generated by a majority of software applications. PDF documents can contain all types of media in them like links, input form fields, video and can be signed electronically.

Typical use cases to extract text from PDF files – Key data extraction

  1. In a document-intensive business, a huge volume of pdf documents needs manual processing for data entry which demands a huge workforce. As a result, it slows down the business, hence adds more costs and introduces manual errors.
  2. Key data elements from pdf documents (e.g, Invoice Number, Date, Total, etc. from an Invoice) need to be extracted and exported in a structured format like Excel, Microsoft SQL Server, etc.
  3. When a business needs to build analytics on extracted data to gain insight into the data currently sitting in pdf files.

Specific use cases to extract text from PDF files

  1. Supplier Invoices.
  2. Purchase Orders.
  3. Insurance Contracts & Claims.
  4. Customer onboarding forms.
  5. Standardized Reports.
  6. Electronic Health Records.
  7. Shipping documents.
  8. Proof of delivery.
  9. and the list goes on and on…

How harder it is to extract text from pdf files?

It depends on the type of pdf file which can be either searchable or image-based.

  1. Extracting data from the huge volume of image-based pdf files can get really time-consuming, messy, and error-prone. Scanners normally produce Image-based pdf files. In this case, the pdf file contains a scanned image/photo of the actual document. There is no way to copy/paste the text from an image-based pdf, so an operator needs to manually read and key in a text in the destination software application.
  2. Nowadays there are some advanced Optical Character Recognition (OCR) based document scanners available in the market. These smart scanners extract actual text from paper documents on the fly during the scan process and the final output is a pdf file with the text which can be searched, hence the name “searchable pdf”. In this case, the data entry operator can locate, copy & paste the text from pdf files to the business application and will be less time-consuming.

Should I automate extracting text from pdf files?

That depends on the volume, type (image/searchable), and the amount of text/data you need to process from each pdf file;

Single/Multipage low volume:

To be honest, if we are talking about a few pdf files per day, it’s not a huge challenge to manually extract data and key in that data in your line-of-business system. So it won’t make any sense to introduce automation, as it is going to be overkill.

Single/Multipage high volume:

In this case, the data entry operator has to individually open each pdf file, locate the data fields from the correct pages, then copy/paste data in case of searchable pdf. It would get harder for the operator to manually type in the text in the destination system when the pdf is not searchable. Formatting the dates, numbers during the data entry process would further make it more time-consuming and error-prone.

So using a modern data capture cloud-based software like DocAcquire to automate the data entry process would yield a huge ROI to any business.

What tools are available to extract text from pdf files – Full page data extraction?

If you simply want to convert a pdf file to any other standard format then you can use the following tools;

  1. Adobe Acrobat
  2. PDF To Text
  3. Online OCR – Allows you to convert PDF to Word, PDF to Excel & PDF to Text
  4. Many more just Google “convert scanned pdf to text”

I don’t want to extract all the data from pdf files

If your requirement is to extract only key (specific) data fields from pdf files. An example would be Invoice Date, Invoice Number, Tax, Total from a Supplier Invoice. If you are also looking to store extracted data in a structured format like Excel, Microsoft SQL Server, Microsoft Sharepoint, or in your business system.

If the above is the case, you are looking for an “automated data capture software” which is based on Optical Character Recognition (OCR) and Machine Learning.

How automated data capture software works?

The majority of modern automated data capture platforms are built on a workflow system. A typical document extraction workflow goes through the following stages;

  1.  To extract only specific data fields, you can train the software with a bunch of documents.
  2. Configure the workflow to import, classify, extract, verify, and export the extracted data to your chosen destination.

Here are the workflow steps of an automated data capture software

Import:

A variety of file types like pdf files and scanned images can be uploaded to data capture software. These documents can come from an array of sources;

  • Email Inboxes
  • Network Folders
  • Mobile Phones (Document Scanner Apps)
  • Directly uploaded/imported into the software
  • Cloud-based file storage applications like Google Drive , Box , Dropbox, etc.

DocAcquire has a strong integrations engine that enables to import of documents from a wide variety of sources.

Classify:

Every document is classified based on its layout and content. Once classified, the document is ready for the next stage of data extraction. The system marks the document as Unclassified if it can’t recognize the document layout and content. In such a case, the document needs to be trained.

Extract:

Once the document is classified, based on the initial training the required data/fields are extracted. Once the data extraction is done, the document is sent for the next stage of the workflow for verification.

Verify:

This is the stage where a human operator comes in and verifies the extracted data, fixes any potential errors, and marks the document Ready for the next stage for Export. This is normally a less time-consuming process as the majority of the heavy lifting is already done by the data extraction software. The involvement of a human to verify the extracted data guarantees the maximum accuracy of data after data extraction.

Export:

At this stage, a document extraction is already done and data is verified by a human operator, the data is then exported to the selected destination.

Different approaches to extract data from PDF

There are many ways you can extract data from a PDF file. But the approach entirely depends on your use case or requirement.

  • Full Page Data Extraction
  • Extract Specific/Key Data

Full Page Data Extraction

If you are after the full text on all the pages of PDF. That is way simple and straightforward.

Here are a few reasons why you want to use this approach;

  • Your PDF is image-based, which means the PDF has been generated from a scanner and there is no way to search text inside the PDF document. SO once you get the raw text from OCR, you can generate a new PDF document with the extracted text.
  • Another reason could be to find insights and relationships in text extracted from unstructured PDF documents like Letters, Medical Transcriptions, Product Catalogs, etc. In order to extract these insights from the documents, you would probably use techniques like NLP (Natural Language Processing) engine which you may have developed specifically for your domain, or use off-the-shelf service like AWS Comprehend. You will feed the raw extracted text to the NLP engine to get the desired results.

Extract Specific/Key Data

If you want to extract specific data points from a PDF document then that a different ball game – The complexity increases – especially when the documents get very unstructured like contracts, letters, etc.

Extract Transactional Data

What is Transactional data? When a company does business with some other organization which involves the exchange of products or services, and during this process data that is captured is called transactional data.

For example, when a product is sold, the company that sells it raise an invoice. The invoice contains the required financial data about which product(s) were sold, how many, and for how much the value. Some other transactional documents include Purchase Orders, Proof Of Delivery, Bill of Lading, etc.

Why is it hard to extract key/transactional data from a document?

Having documents that have fixed rules in the placement of individual data points (like, invoice number, invoice data, totals, etc.) is less complicated compared to other documents where a particular data point (like invoice number) is not fixed to one location, and it gets more complex when the PDF is scanned which is of poor quality, skewed and the data point can exist on any random page of a PDF document. It gets more tricky when that data point needs to be extracted from dense and unstructured text.

How DocAcquire extracts text from pdf files

DocAcquire is a modern cloud-based data capture software that can extract data from a variety of file formats like pdf, png, jpeg, and tiff. By default, DocAcquire uses the OCR engine called AWS Textract to read the text from documents.

DocAcquire Key features for pdf data extraction

  1. If your pdf documents are of poor quality (scanned) or you need to read handwritten text from them, you can easily configure DocAcquire to use Google Vision. Google Vision is a powerful image analysis service based on Machine Learning and delivers extremely high accuracy when it comes to extracting text from pdf and other file formats.
  2. Supports multi-page documents.
  3. Support for extracting table data.
  4. A seamless user interface to verify the extracted data from multiple pages.
  5. Send extracted data to any business application on-premises or on the cloud using the standard REST API endpoint.
  6. You can even pull out the extracted data using REST API endpoints in JSON format.
  7. You can also extract your data directly to Microsoft SQL Server.

DocAcquire’s – Cognitive Invoice is a platform built on deep learning which makes invoice data extraction a breeze. Here is a quick sneak peek of the platform, you can see how easier is it to get up and running – there’s no need to build and maintain templates.

I hope you found this blog helpful and if you have any specific questions please Contact Us and we would be more than happy to answer any of your questions.

Back to blog

Latest articles

blog

7 Tips to Streamline Accounts Payable Process

Do your accounts payable department give you a headache? Are you procrastinating on sorting your invoices? You are not alone! Most business owners loathe the invoice handling process, it may seem...

Read article
blog

How to Optimise Remote Invoice Management?

The Covid-19 pandemic brought “the new normal” along with it. People now don’t go out unnecessarily, businesses are working remotely, schools and colleges are taking online classes, and...

Read article
blog

The easiest way to extract table from pdf?

PDF is one of the most popular document formats to share and write data. You may come across millions of situations where you may need to extract data from pdf. The task becomes even more tedious...

Read article
blog

Moving from Traditional to Cognitive OCR

Using Cognitive OCR to identify data is a progressive way to extract data from documents. Artificial Intelligence is a way to recreate human intelligence by enabling a machine to read the...

Read article
blog

Automated Invoice Scanning & Invoice Capture Software

This article discusses invoice capture software and its application in improving your business processes. It explains how does invoice scanning and capturing eliminate the need for manual keying of...

Read article
blog

What is Data Extraction?

Introduction We live in a highly competitive world where data is the top priority. Comprehensive operating sheets, customer personal data, inter-company information, sales figures, and data...

Read article