in the Enterprise
is leveraged by AI
in resources with AI
relevant information is laborious and error-prone
makes rule-based workflow automation impractical
is needed to understand document context
As companies continually seek to automate as many data processes as they can, they hit a wall when attempting to process unstructured data, such as that found in long form financial documents, claims forms, contracts and emails. The reason is simple: by its very nature, unstructured data varies dramatically from one document to the next.
Historically companies have dealt with unstructured data processing by using the only means at their disposal: having humans read the unstructured documents, find relevant information, and manually enter it into the desired destination system. Such a process is inherently unscalable and error-prone, as humans tend to get tired and bored.
More recently, companies have tried to use rule-based approaches such as OCR templating and/or Robotic Process Automation for unstructured data processing. These often meet with little success because the templates are too brittle to handle the variation inherent in unstructured data. Intelligent intake, which is a form of intelligent document processing, offers a viable solution.
This guide will explain various approaches to unstructured data processing and how to process unstructured documents so you can make an informed decision on which will work best for you.
Among the first tools youâre likely to encounter while researching solutions for automating processes involving unstructured data in documents is optical character recognition, or OCR.
You can use OCR to deal with scanned documents. Scanned documents are effectively images, even if they are saved in PDF format. Images, of course, are not readily machine-readable, so computers canât immediately process what humans can clearly see is text in the document. OCR document processing addresses that issue by identifying text in such documents and converting it to a digitized format that computers can deal with.
In some instances, you may not even need OCR to interpret a PDF document. If you take a Word document, for example, and save it as a PDF, the text information should be automatically preserved in text format, and any system can then process it directly.
Another approach to automating unstructured document processing is to use rule-based tools. In fact, you may have come across the term âOCR templatingâ in your search for document processing solutions. Itâs often confused with vanilla OCR. While it also uses OCR in its workflow (i.e., turning images into readable text), the key difference arises in the term âtemplating.â These vendors offer to produce rule-based templates built from a handful of your documents.
This approach requires humans to mark up a sample document by drawing bounding boxes around the relevant bits of information you want to extract and then label each box with the type of data youâre trying to capture â name, address, material changes, etc.
Youâre basically creating an expert system based on three essential elements:
Topology, meaning where elements appear on the page. For example, a date may always be near the top while a total, as in an invoice, typically appears at the bottom.
Regex, which is a computer programming technique for searching words or information according to letter patterns. For example, you may write a rule to identify Social Security numbers by searching for the pattern: xxx-xx-xxxx. If, however, OCR misidentifies a number or letter (which is often the case), like the number â1â for the letter âlâ, regex rules tend to unravel.
Taxonomy, which is the classification of rules, often following a hierarchy. âNumberâ may be a main category, followed by numerous subcategories, such as âdollar amount,â âsocial security number,â âstreet addressâ and so on.
For example, you could write a rule to pick up the material changes note in an SEC filing like so:
This sort of approach works well if you need to process many documents that all look the same and will remain so â perhaps financial statements for personal bank accounts that all come from the same bank. If all the statements are from the same financial institution and all the data youâre after is in the same place on every single statement, then something like an OCR templating approach may be a viable solution.
But as soon as any variation enters the equation, like statements from a different bank, or variation in language for a longer form of data, such as picking up material changes in SEC filings, humans will need to intervene to create a new or revised template. Itâs easy to see how coming up with all the required templates can quickly become unwieldy â and costly. And for use cases consisting of thousands of documents or more, all unique, rule-based approaches will fail.
Intelligent intake takes a different approach to unstructured document processing. Rather than create templates for each possible variation, intelligent intake tools use technologies such as natural language processing (NLP) that enable them to accurately understand text, tables, and images within the context of any given document.
Intelligent intake solutions do use OCR to âreadâ text so that it can be input to the platform. However, rather than requiring rules that identify each specific element in a document, it uses deep learning to contextually understand what data youâre looking to extract from a given set of documents.
Thereâs no need to create templates for every possible variation of a document you may have to deal with because intelligent intake solutions can take what theyâve learned from one document and apply it to others â a concept known as transfer learning.
For example, if you were building a bot to help analysts quickly review SEC filings, you might want it to automatically highlight revenue increases or decreases to assess how well each company is performing. You could write a rule to look for the word âincrease,â but could it also accommodate synonyms, misspellings, and implied meaning? If the factors that contributed to the increase are found not next to the word âincreaseâ but in a completely different sentence, how do you write rules for that?
Because IDP solutions understand language and context, they can pick up on revenue change information no matter what synonym you use and attribute it to the correct catalysts. In this same vein, OCR misspelling errors that throw off regex rules present little challenge. Humans can easily read misspelled words because the brain automatically recognizes the mistake and corrects for it. NLP models can do much the same thing, enabling them to more accurately interpret scanned images than rule-based systems can. With IDP, template-based solutions for highly variable data become obsolete.
In the example above, from an Apple 10-Q form, how do you write enough rules to show that âincrease,â âfavorable,â and multitudes of synonyms are all related, and that iPhone sales and a strong U.S. dollar were important contributing factors? With this example you can see that rule-based approaches are not automation, they are insanity.
Both rule-based and intelligent automation approaches have their place. Which you choose just depends on your exact application requirements.
To determine which is best for you, thereâs basically one big question to consider;Â What is the nature of the documents youâre dealing with?
Many vendors purport to have intelligent automation solutions that will meet your requirements for unstructured data processing, but not all are selling ârealâ solutions with sound artificial intelligence technology behind them. So youâll need to conduct due diligence for your organization.
Following are some questions to consider as you vet document processing providers:
Whatâs your algorithm strategy?
Some vendors have homegrown algorithms while others, like Indico, use open-source algorithms that are constantly being improved. If a vendor canât explain where its algorithm comes from and what it does, consider that a big red flag.
Whatâs your database strategy?Â
Answers here again will vary. In Indicoâs case, the strategy centers around our base model that comprises more than 500 million data points that enable the intelligent intake platform to understand human language and context. Users then customize the model for their specific application but require 100x to 1000x less data than if starting from scratch.
Whatâs your application strategy?
AI solutions must come with applications that make it accessible to those who actually have to use it, including business people. Indicoâs platform, for example, has a point-and-click user interface that makes it simple for anyone to create working process automation solutions and models in just hours.
As you can see, there are many options for processing unstructured documents. With the advances in machine learning that have been made lately, many industries are increasingly choosing intelligent process automation over the rule-based techniques of the past.