in the Enterprise
is leveraged by AI
in resources with AI
relevant information is laborious and error-prone
makes rule-based workflow automation impractical
is needed to understand document context
As companies continually seek to automate as many data processes as they can, they hit a wall when attempting to process unstructured data, such as that found in long form financial documents, claims forms, contracts and emails. The reason is simple: by its very nature, unstructured data varies dramatically from one document to the next.
Historically companies have dealt with unstructured data processing by using the only means at their disposal: having humans read the unstructured documents, find relevant information, and manually enter it into the desired destination system. Such a process is inherently unscalable and error-prone, as humans tend to get tired and bored.
More recently, companies have tried to use rule-based approaches such as OCR templating and/or Robotic Process Automation for unstructured data processing. These often meet with little success because the templates are too brittle to handle the variation inherent in unstructured data. Intelligent document processing automation software, on the other hand, offers a viable solution.
This guide will explain various approaches to unstructured data processing and how to process unstructured documents so you can make an informed decision on which will work best for you.
Among the first tools you’re likely to encounter while researching solutions for automating processes involving unstructured data in documents is optical character recognition, or OCR.
You can use OCR to deal with scanned documents. Scanned documents are effectively images, even if they are saved in PDF format. Images, of course, are not readily machine-readable, so computers can’t immediately process what humans can clearly see is text in the document. OCR document processing addresses that issue by identifying text in such documents and converting it to a digitized format that computers can deal with.
In some instances, you may not even need OCR to interpret a PDF document. If you take a Word document, for example, and save it as a PDF, the text information should be automatically preserved in text format, and any system can then process it directly.
Caution! Beware OCR Templating…
You may have come across the term “OCR templating” in your search for document processing solutions. It’s often confused with vanilla OCR. While it also uses OCR in its workflow (i.e., turning images into readable text), the key difference arises in the term “templating.” This means that these vendors offer to produce rule-based templates built from a handful of your documents.
Another approach to automating unstructured document processing is to use rule-based tools.
This approach requires humans to mark up a sample document by drawing bounding boxes around the relevant bits of information you want to extract and then to label each box with the type of data you’re trying to capture – name, address, material changes, etc.
You’re basically creating an expert system based on three essential elements:
Topology, meaning where elements appear on the page. For example, a date may always be near the top while a total, as in an invoice, typically appears at the bottom.
Regex, which is a computer programming technique for searching words or information according to letter patterns. For example, you may write a rule to identify Social Security numbers by searching for the pattern: xxx-xx-xxxx. If, however, OCR misidentifies a number or letter (which is often the case), like the number “1” for the letter “l”, regex rules tend to unravel.
Taxonomy, which is the classification of rules, often following a hierarchy. “Number” may be a main category, followed by numerous subcategories, such as “dollar amount,” “social security number,” “street address” and so on.
For example, you could write a rule to pick up the material changes note in an SEC filing like so:
This sort of approach works well if you need to process many documents that all look the same and will remain so – perhaps financial statements for personal bank accounts that all come from the same bank. If all the statements are from the same financial institution and all the data you’re after is in the same place on every single statement, then something like an OCR templating approach may be a viable solution.
But as soon as any variation enters the equation, like statements from a different bank, or variation in language for a longer form of data, such as picking up material changes in SEC filings, humans will need to intervene to create a new or revised template. It’s easy to see how coming up with all the required templates can quickly become unwieldy – and costly. And for use cases consisting of thousands of documents or more, all unique, rule-based approaches will fail.
Intelligent process automation (IPA) takes a different approach to unstructured document processing. Rather than create templates for each possible variation, IPA tools use technologies such as natural language processing (NLP) that enable them to accurately understand text, tables and images within the context of any given document.
IPA tools do use OCR, to “read” text so that it can be input to the IPA platform. However, rather than requiring rules that identify each specific element in a document, IPA uses deep learning to contextually understand what data you’re looking to extract from a given set of documents.
There’s no need to create templates for every possible variation of a document you may have to deal with, because IPA tools can take what they’ve learned from one document and apply it to others – a concept known as transfer learning.
For example, if you were building a bot to help analysts quickly review SEC filings, you might want it to automatically highlight revenue increases or decreases to assess how well each company is performing. You could write a rule to look for the word “increase,” but could it also accommodate synonyms, misspellings, and implied meaning? If the factors that contributed to the increase are found not next to the word “increase,” but in a completely different sentence, how do you write rules for that?
Because IPA understands language and context, it can pick up on revenue change information no matter what synonym you use, and attribute it to the correct catalysts. In this same vein, OCR misspelling errors that throw off regex rules present little challenge to IPA. Humans can easily read misspelled words because the brain automatically recognizes the mistake and corrects for it. NLP models can do much the same thing, enabling them to more accurately interpret scanned images than rule-based systems can. With IPA, template-based solutions for highly variable data become obsolete.
In the example above, from an Apple 10-Q form, how do you write enough rules to show that “increase,” “favorable,” and multitudes of synonyms are all related, and that iPhone sales and a strong U.S. dollar were important contributing factors? With this example you can see that rule-based approaches are not automation, they are insanity.
Both rule-based and intelligent automation approaches have their place. Which you choose just depends on your exact application requirements.
To determine which is best for you, there’s basically one big question to consider; What is the nature of the documents you’re dealing with?
Many vendors purport to have intelligent automation solutions that will meet your requirements for unstructured data processing, but not all are selling “real” solutions with sound artificial intelligence technology behind them. So you’ll need to conduct due diligence for your organization.
Following are some questions to consider as you vet document processing providers:
What’s your algorithm strategy?
Some vendors have homegrown algorithms while others, like Indico, use open-source algorithms that are constantly being improved. If a vendor can’t explain where its algorithm comes from and what it does, consider that a big red flag.
What’s your database strategy?
Answers here again will vary. In Indico’s case, the strategy centers around our base model that comprises more than 500 million data points that enable the IPA platform to understand human language and context. Users then customize the model for their specific application but require 100x to 1000x less data than if starting from scratch.
What’s your application strategy?
AI solutions must come with applications that make it accessible to those who actually have to use it, including business people. Indico’s platform, for example, has a point-and-click user interface that makes it simple for anyone to create working process automation solutions and models in just hours.
As you can see, there are many options for processing unstructured documents. With the advances in machine learning that have been made lately, many industries are increasingly choosing intelligent process automation over the rule-based techniques of the past.