If you happen to know any college-aged kids who are wondering what to do with their lives, or perhaps a colleague who is looking for their next move, here’s a tip: pursue a career in unstructured data analytics.
I’ve been in the data science field for some time now and can tell you the ability to analyze unstructured data is a largely untapped area that is of immense value to businesses. I’m convinced will be a hot field in coming years.
Plenty of data analytics and business intelligence (BI) tools exist that enable companies to extract intelligence from structured data, whether it’s in a spreadsheet or a database. You can create all sorts of charts, pivot tables and dashboards using BI apps or data visualization tools like Tableau and Microsoft Power BI. They help you immediately analyze data, spot trends, and otherwise get value from the data you already have.
Or at least, some of the data you have. The problem is, at least 85% of all data in most companies is in an unstructured format, whether PDFs, images, Word documents, email or the like. Sure, you can paste an image into a spreadsheet, but that won’t help you analyze it.
Devising an unstructured data analytics solution
It’s an issue I faced with my previous employer a few years back. We had network drives full of legal documents the company wanted to analyze in hopes of finding trends, commonalities in certain clauses, and executed language that might be considered risky now. We had to roll our own tools using something similar to an ELK stack. ELK stands for three open source projects – Elasticsearch, Logstash and Kibana – that respectively address search and analytics, data processing and transformation, and visualization.
We were successful in building a data lake with query capabilities on top that enabled users to look for a given clause, for example, and find it wherever it existed across all the documents in the data lake. They could then take the resulting data and analyze it using tools like Excel or Word to identify trends or find specific values.
It was powerful enough to enable the company to make about 25,000 contracts searchable. In one instance, we used it to find all contracts that mentioned LIBOR, which was a soon-to-be-phased out interest rate that was (and still is) a big deal for financial companies.
That was about 3 years ago. More tools are out there today, and machine learning is playing a big role. Let’s say a commercial real estate firm wants to find all contracts with clauses that are “to the best of the party’s knowledge” or that have to do with contingencies. Such terms may appear anywhere in a contract and the language may not be consistent, ie Ctrl + F won’t always work. The model has to learn the patterns of clauses that contain such terms, so it can find the clauses no matter where they may exist in a contract – a great use case for natural language modeling.
What you’re after is some assurance that you’ve found all of the contracts that contain the target clause(s). It’s like why people use Google instead of Bing or Yahoo. With Google, the result you’re after is almost always on the first page. If not, you probably need to refine your search.
Similarly, you need a way to identify if your machine learning model is not finding all the contracts it should and refine it accordingly. The Indico Unstructured Data Platform, for example, will give you a score indicating how confident it is that it’s giving you the right result. If it’s only 75% confident, you can go back and refine it until the score gets better.
Use cases for unstructured data analytics
In terms of who stands to benefit from analyzing unstructured data, most any company probably has use cases.
Legal is a big one. Law firms have tons of contracts and stand to benefit if they can analyze them to find ways to favor their clients. E-discovery platforms that help find contracts related to a given subject or company are a start but have been meet with only lukewarm adoption. Adding machine learning capabilities to give them customizable intelligence would be a big step forward.
Insurance is another vertical that’s ripe for unstructured document analytics. Older firms, especially, have decades’ worth of policies they can examine to find trends that may shift their actuarial tables a percentage point or two, which can literally mean millions of dollars in savings.
What’s required to make all of these use cases possible is a platform that can understand unstructured data and convert it into a format that traditional data analytics tools can deal with. That’s exactly what the Indico Data platform does. It also enables you to build intelligent automation models, to ease the process of finding relevant data in your sea of unstructured content.
We’re just getting started with the idea of applying machine learning and data analytics capabilities to unstructured data and content sources. As noted up top, I believe it’s going to be a huge area in the years ahead, and tools like the Indico Data Unstructured Data Platform should be a welcome addition to the effort.
To learn more about how the platform can help you scale your mountain of unstructured documents, check out our interactive demo or schedule a live demo. Or just hit us up with any questions you may have. We’ll be happy to help you get started on the unstructured data analytics journey.