Webinar: How to Enhance Carrier Decisioning through Collaborative Ecosystems with Guidewire and Unqork
Register Now
  Everest Group IDP
             PEAK MatrixĀ® 2022  
Indico Named as Major Contender and Star Performer in Everest Group's PEAK MatrixĀ® for Intelligent Document Processing (IDP)
Access the Report

BLOG

Building Better Search

July 10, 2017 | Business, Developers, Machine Learning

Back to Blog

In speech and writing, how often do we use one term — and only that term — to describe an idea? For example, if you were searching through a document for information relating to a businessā€™ current assets, looking up only ā€œcurrent assetsā€ would mean that you miss out on anything discussing cash, short-term assets, receivables, inventory, and prepaid expenses. Yet, in too many of our search interactions today, searching for information is limited to keyword lookups. Some newer techniques augment strict keyword based approaches to automatically include synonyms using pre-built dictionaries. While this can pay dividends, this approach can be brittle and isnā€™t as comprehensive as concept based searches. Effective concept based searches which account for not just synonyms but also for context in language can lead to a very different search experience. Imagine a scenario where a search for ā€œwealth inequalityā€ you also draws hits such as ā€œthe gap between the rich and the poorā€, ā€œunfair distribution of wealthā€, ā€œincome inequalityā€, and so forth.
Pure keyword-to-keyword search is unintuitive to human speech and expression. It limits us — and with todayā€™s deep learning capabilities, itā€™s a limitation that we can avoid.

Remember when a few of months ago, a judge ordered the then-nominee for the EPA, Scott Pruitt, to release thousands of emails so that the CMD watchdog organization could inspect them for ties to fossil fuel companies? Nearly 7,000 pages of emails were handed over, and the following day, CMD revealed that Pruitt was indeed friendly with various fossil fuel businesses. Now, we donā€™t know whether CMD used a keyword search to find the relevant documents, but if that was all they did, they would have had to brainstorm every possible keyword and its variations and still fall short of true results.

With a well-built concept search system, entering ā€œfossil fuelsā€ into the search bar should not only return all mentions ā€œfossil fuelsā€, but anything related to oil and gas too, from fracking plans to oil companies.

For legal aides who spend days poring through hundreds of documents, emails, and other content to discover useful evidence, such a system would save a significant amount of time and lead to improved quality. It is applicable to other industries too, from finance to medical — anything which would require combing through reams of text.

So, how does fuzzy search work?

indicoā€™s Text Features API creates of hundreds of thousands of rich feature vector representations for a given text input, learned using deep learning techniques. These feature vectors ā€” numerical representations in multi-dimensional space ā€” are a computerā€™s way of assigning meaning to language. We can use these representations to calculate similarity of concepts between sentences and a search query. This is why you can search a broad concept like “free market” and get results about money, competition, and demand.

An experiment

We decided to compare concept vs. keyword search on another public email dataset — the Enron emails — with a simple concept search model, built with indicoā€™s Text Features API. Specifically, we explored the 1000+ emails of some randomly chosen users and determined which concepts to search for based on a word cloud analysis of the entire dataset from Zichen Wang. We broke down the emails to the sentence level using indicoā€™s Text Features APIā€™s automatic sentence splitting function. The top results for concept searches for the phrase effect of economic downturn were:

Results for query "effect of economic downturn"

These are all excellent examples of concerns and the impact of a recession — and note how our search query is explicitly stated in any of these sentences. We also found interesting results for human resources:

Results for query "Human Resources"

These results are particularly intriguing as they donā€™t mention ā€œHRā€ or a specific task that we would associate with HR, like ā€œcareersā€ or ā€œhiringā€, but the content chosen by the Text Features API are clear examples of these functions in action.

Speaking of HR, while poking around through the database, we noticed an email that clearly indicated some kind of (failed) tryst had taken place between two co-workers. So, we pulled a sentence directly from that email to see if we could pinpoint any other communication that revealed a similar patternā€¦

Search query: ā€œNothing has changed that nor do I think we need to act weird around each other going forward.ā€

Ooh lala.


Note how out of all the fuzzy search top three results, only one email contained the term we were searching for. If we consider this on a grander scale, how much information are we missing out on by simply using keyword searches? How many legal cases, business deals, and other decisions may have been affected by incomplete information?

If youā€™d like to learn more or are looking to implement a machine learning solution for your business, reach out to us at contact@indico.io.

[addtoany]

Increase intake capacity. Drive top line revenue growth.

[addtoany]

Get started with Indico

Schedule
1-1 Demo

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.
Subscribe to our blog

Get our best content on intelligent automation sent to your inbox weekly!