Unstructured data explained: Why skewed data presents a big problem for intelligent document processing

There’s a problem in data science that can have a profoundly negative effect on intelligent document processing models: skewed data.

It’s not exactly unusual for data scientists, whether intentionally or not, to skew artificial intelligence models based on their own thoughts about the direction they want the models to go, or the results they’re seeking to get.

In a recent episode of the “Unstructured Explained” video series, Slater Victoroff, Founder and CTO of Indico Data, discussed the skewed data issue and the effect it has on automation models with Brandi Corbello, VP of Business Development at Indico and former VP of Transformation at Cushman & Wakefield.

The “overfitting” issue

“This is actually a really big problem,” Victoroff said, noting it manifests itself in “overfitting” with respect to AI models. Overfitting is when a model is trained on a data set that is too narrow for whatever criteria the model is trying to deal with. The model incorrectly assumes that any attributes contained in the data set apply to a wider whole. Additionally, it can’t correct its mistake by broadening its horizons because it doesn’t know a mistake is being made.

An article in EliteDataScience.com offers an example. Say you want to predict whether a student will land a job interview based on their resume. You use a model trained on a dataset of 10,000 resumes and their outcomes. When applied to the original data set, the model predicts outcomes with 99% accuracy. But when run against a dataset of resumes it hasn’t seen yet, the model delivers only 50% accuracy.

Trying to avoid such a fate, Victoroff noted the Chinese search firm Baidu got in hot water a few years ago when it broke the rules of the ImageNet contest, which aimed to see which search engine was most adept at identifying images. Contest rules allowed contestants to test their models only twice per week, limiting their opportunities to learn about what they were seeing. Baidu far exceeded that limit, submitting more than 200 tests over six or seven months. The reason: the company knew that seeing more images would help make its model more accurate.

Data scientists can also manually tweak models in order to produce a desired outcome, whether perceived or actual. “People do definitely dismiss the impact that setting hyperparameters have on the end model, because it’s super opaque,” Victoroff said. If a scientist changes a parameter “K” from three to four, “that’s not a level of detail that a manager is going to think about.” But such changes can add up to skew results, especially if the change is more dramatic, such as “changing K from three to 17.684 because you know that’s perfect for this dataset,” he said.

Accurate models, no data scientists required

Intentional or not, any attempt to skew data can have an adverse effect on automation models. It can also put companies in jeopardy of running afoul of compliance regulations in some instances – such as if a bank can’t adequately explain why its automation model denied a mortgage application from a given client.

The Indico Unstructured Data Platform enables companies to build process automation models that are completely transparent. In large part, that’s because they’re built not by data scientists, but by those who are involved in the process day to day. Using intuitive Indico Data tools, your process experts train models by labeling 200 or so actual documents, including those containing unstructured content.

Under the covers, advanced artificial intelligence technologies then enable the platform to recognize the values or content types in the labeled fields wherever they may appear in a document – not just in an exact spot, as with templated tools. The models are trained to recognize the “idea” of an invoice and all its vagaries, vs. the exact fields on an invoice from a particular vendor.

In that fashion, Indico Data models never fall victim to skewed data. What’s more, with Staggered Loop Training features inherent in the latest Indico 5, models can continually learn from human-in-the-loop input.

To learn more, check out the full episode below.

[addtoany]

Increase intake capacity. Drive top line revenue growth.

Schedule Demo

Resources

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Upcoming Webinar

From Automation to Agency: Indico Data Unveils the Future of Insurance with Agentic AI

Technology

Solutions

Why Indico

By Industry

By Use Case

By Role

Services

Resources

Documentation

Customer Stories

Partners

Find a Partner

Become a Partner

Partner Portal

Company

Press & Events

Careers

BLOG

Unstructured data explained: Why skewed data presents a big problem for intelligent document processing

The “overfitting” issue

Accurate models, no data scientists required

Increase intake capacity. Drive top line revenue growth.

Related Posts

Center of Excellence, Intelligent Process Automation, Unstructured Data, Unstructured Unlocked

Indico data debuts “Unstructured Unlocked” podcast – expert advice on intelligent automation

Confessions from a data scientist leader, Unstructured Data

How intelligent document processing machine learning is changing the unstructured data analytics game

Unstructured Data

Unstructured data explained: Why rule-based tools make for brittle document process automation models

See how Indico Data’s AI-driven solutions can revolutionize your decision-making processes.

Schedule
1-1 Demo

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.

Upcoming Webinar

From Automation to Agency: Indico Data Unveils the Future of Insurance with Agentic AI

Technology

Solutions

Why Indico

By Industry

By Use Case

By Role

Resources

Documentation

Customer Stories

Indico Named as Major Contender and Star Performer in Everest Group's PEAK Matrix® for Intelligent Document Processing (IDP)

BLOG

Unstructured data explained: Why skewed data presents a big problem for intelligent document processing

The “overfitting” issue

Accurate models, no data scientists required

Increase intake capacity. Drive top line revenue growth.

Related Posts

Center of Excellence, Intelligent Process Automation, Unstructured Data, Unstructured Unlocked

Indico data debuts “Unstructured Unlocked” podcast – expert advice on intelligent automation

Confessions from a data scientist leader, Unstructured Data

How intelligent document processing machine learning is changing the unstructured data analytics game

Unstructured Data

Unstructured data explained: Why rule-based tools make for brittle document process automation models

See how Indico Data’s AI-driven solutions can revolutionize your decision-making processes.

Schedule1-1 Demo

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.

Get our best content on intelligent automation sent to your inbox weekly!

Schedule
1-1 Demo