There’s a problem in data science that can have a profoundly negative effect on intelligent document processing models: skewed data.
It’s not exactly unusual for data scientists, whether intentionally or not, to skew artificial intelligence models based on their own thoughts about the direction they want the models to go, or the results they’re seeking to get.
In a recent episode of the “Unstructured Explained” video series, Slater Victoroff, Founder and CTO of Indico Data, discussed the skewed data issue and the effect it has on automation models with Brandi Corbello, VP of Business Development at Indico and former VP of Transformation at Cushman & Wakefield.
The “overfitting” issue
“This is actually a really big problem,” Victoroff said, noting it manifests itself in “overfitting” with respect to AI models. Overfitting is when a model is trained on a data set that is too narrow for whatever criteria the model is trying to deal with. The model incorrectly assumes that any attributes contained in the data set apply to a wider whole. Additionally, it can’t correct its mistake by broadening its horizons because it doesn’t know a mistake is being made.
An article in EliteDataScience.com offers an example. Say you want to predict whether a student will land a job interview based on their resume. You use a model trained on a dataset of 10,000 resumes and their outcomes. When applied to the original data set, the model predicts outcomes with 99% accuracy. But when run against a dataset of resumes it hasn’t seen yet, the model delivers only 50% accuracy.
Trying to avoid such a fate, Victoroff noted the Chinese search firm Baidu got in hot water a few years ago when it broke the rules of the ImageNet contest, which aimed to see which search engine was most adept at identifying images. Contest rules allowed contestants to test their models only twice per week, limiting their opportunities to learn about what they were seeing. Baidu far exceeded that limit, submitting more than 200 tests over six or seven months. The reason: the company knew that seeing more images would help make its model more accurate.
Data scientists can also manually tweak models in order to produce a desired outcome, whether perceived or actual. “People do definitely dismiss the impact that setting hyperparameters have on the end model, because it’s super opaque,” Victoroff said. If a scientist changes a parameter “K” from three to four, “that’s not a level of detail that a manager is going to think about.” But such changes can add up to skew results, especially if the change is more dramatic, such as “changing K from three to 17.684 because you know that’s perfect for this dataset,” he said.
Accurate models, no data scientists required
Intentional or not, any attempt to skew data can have an adverse effect on automation models. It can also put companies in jeopardy of running afoul of compliance regulations in some instances – such as if a bank can’t adequately explain why its automation model denied a mortgage application from a given client.
The Indico Unstructured Data Platform enables companies to build process automation models that are completely transparent. In large part, that’s because they’re built not by data scientists, but by those who are involved in the process day to day. Using intuitive Indico Data tools, your process experts train models by labeling 200 or so actual documents, including those containing unstructured content.
Under the covers, advanced artificial intelligence technologies then enable the platform to recognize the values or content types in the labeled fields wherever they may appear in a document – not just in an exact spot, as with templated tools. The models are trained to recognize the “idea” of an invoice and all its vagaries, vs. the exact fields on an invoice from a particular vendor.
In that fashion, Indico Data models never fall victim to skewed data. What’s more, with Staggered Loop Training features inherent in the latest Indico 5, models can continually learn from human-in-the-loop input.
To learn more, check out the full episode below.