Indico Data receives top position in Everest Group's Intelligent Document Processing (IDP) Insurance PEAK Matrix® 2024
Read More
  Everest Group IDP
             PEAK Matrix® 2022  
Indico Named as Major Contender and Star Performer in Everest Group's PEAK Matrix® for Intelligent Document Processing (IDP)
Access the Report

BLOG

Building Better Search

July 10, 2017 | Business, Developers, Machine Learning

Back to Blog

In speech and writing, how often do we use one term — and only that term — to describe an idea? For example, if you were searching through a document for information relating to a business’ current assets, looking up only “current assets” would mean that you miss out on anything discussing cash, short-term assets, receivables, inventory, and prepaid expenses. Yet, in too many of our search interactions today, searching for information is limited to keyword lookups. Some newer techniques augment strict keyword based approaches to automatically include synonyms using pre-built dictionaries. While this can pay dividends, this approach can be brittle and isn’t as comprehensive as concept based searches. Effective concept based searches which account for not just synonyms but also for context in language can lead to a very different search experience. Imagine a scenario where a search for “wealth inequality” you also draws hits such as “the gap between the rich and the poor”, “unfair distribution of wealth”, “income inequality”, and so forth.
Pure keyword-to-keyword search is unintuitive to human speech and expression. It limits us — and with today’s deep learning capabilities, it’s a limitation that we can avoid.

Remember when a few of months ago, a judge ordered the then-nominee for the EPA, Scott Pruitt, to release thousands of emails so that the CMD watchdog organization could inspect them for ties to fossil fuel companies? Nearly 7,000 pages of emails were handed over, and the following day, CMD revealed that Pruitt was indeed friendly with various fossil fuel businesses. Now, we don’t know whether CMD used a keyword search to find the relevant documents, but if that was all they did, they would have had to brainstorm every possible keyword and its variations and still fall short of true results.

With a well-built concept search system, entering “fossil fuels” into the search bar should not only return all mentions “fossil fuels”, but anything related to oil and gas too, from fracking plans to oil companies.

For legal aides who spend days poring through hundreds of documents, emails, and other content to discover useful evidence, such a system would save a significant amount of time and lead to improved quality. It is applicable to other industries too, from finance to medical — anything which would require combing through reams of text.

So, how does fuzzy search work?

indico’s Text Features API creates of hundreds of thousands of rich feature vector representations for a given text input, learned using deep learning techniques. These feature vectors — numerical representations in multi-dimensional space — are a computer’s way of assigning meaning to language. We can use these representations to calculate similarity of concepts between sentences and a search query. This is why you can search a broad concept like “free market” and get results about money, competition, and demand.

An experiment

We decided to compare concept vs. keyword search on another public email dataset — the Enron emails — with a simple concept search model, built with indico’s Text Features API. Specifically, we explored the 1000+ emails of some randomly chosen users and determined which concepts to search for based on a word cloud analysis of the entire dataset from Zichen Wang. We broke down the emails to the sentence level using indico’s Text Features API’s automatic sentence splitting function. The top results for concept searches for the phrase effect of economic downturn were:

Results for query "effect of economic downturn"

These are all excellent examples of concerns and the impact of a recession — and note how our search query is explicitly stated in any of these sentences. We also found interesting results for human resources:

Results for query "Human Resources"

These results are particularly intriguing as they don’t mention “HR” or a specific task that we would associate with HR, like “careers” or “hiring”, but the content chosen by the Text Features API are clear examples of these functions in action.

Speaking of HR, while poking around through the database, we noticed an email that clearly indicated some kind of (failed) tryst had taken place between two co-workers. So, we pulled a sentence directly from that email to see if we could pinpoint any other communication that revealed a similar pattern…

Search query: “Nothing has changed that nor do I think we need to act weird around each other going forward.”

Ooh lala.


Note how out of all the fuzzy search top three results, only one email contained the term we were searching for. If we consider this on a grander scale, how much information are we missing out on by simply using keyword searches? How many legal cases, business deals, and other decisions may have been affected by incomplete information?

If you’d like to learn more or are looking to implement a machine learning solution for your business, reach out to us at contact@indico.io.

[addtoany]

Increase intake capacity. Drive top line revenue growth.

[addtoany]

Unstructured Unlocked podcast

April 24, 2024 | E45

Unstructured Unlocked episode 45 with Daniel Faggella, Head of Research, CEO at Emerj Artificial Intelligence Research

podcast episode artwork
April 10, 2024 | E44

Unstructured Unlocked episode 44 with Tom Wilde, Indico Data CEO, and Robin Merttens, Executive Chairman of InsTech

podcast episode artwork
March 27, 2024 | E43

Unstructured Unlocked episode 43 with Sunil Rao, Chief Executive Officer at Tribble

podcast episode artwork

Get started with Indico

Schedule
1-1 Demo

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.
Subscribe to our blog

Get our best content on intelligent automation sent to your inbox weekly!