Indico Data receives top position in Everest Group's Intelligent Document Processing (IDP) Insurance PEAK Matrix® 2024
Read More
  Everest Group IDP
             PEAK Matrix® 2022  
Indico Named as Major Contender and Star Performer in Everest Group's PEAK Matrix® for Intelligent Document Processing (IDP)
Access the Report

BLOG

Exploiting Text Embeddings for Industry Contexts

November 14, 2017 | Data Science, Machine Learning

Back to Blog

From Synonyms to Object Properties

It’s well known that word embeddings are excellent for finding similarities between words — specifically, synonyms. We achieve this using supervised machine learning techniques by showing a neural net a dataset of hundreds of millions of pieces of text. The algorithm looks at the context and frequency in which particular words appear together, and maps them out as vectors in multi-dimensional space (or what we otherwise refer to as word embeddings).

A couple of weeks ago, our CTO Slater Victoroff gave a talk on “Using and abusing text embeddings” for the recent AI with the Best conference, where he presented an interesting — and surprising — finding. Not only do word embeddings successfully find synonyms, but also appear to pick up on object properties. For instance, “grass” and “green” seem to map closely to one another, as well as “ocean” and “blue”; “black” and “obsidian”; “sulfur” and “yellow”. These connections are surprising, given the broad range of contexts in the training dataset — we’d expect that the model could only gain this kind of information through rare events like metaphors and analogies. It’s intriguing that the algorithm still managed to learn and represent these relationships, but it’s not clear to us how or why these connections were learned, or how they’re represented in the underlying neural network.

From a utilization perspective, however, we don’t necessarily need to understand that structure. If these general embeddings can actually pick up on object properties or definitions as “weak synonyms” (and it’s not just an odd statistical anomaly), we expect that retraining our embeddings with domain specificity should deliver a similar result for those industry-focused cases. We could then use them to build a more robust fuzzy search tool or chatbots for specific industries.

So, let’s test our hypothesis. As part of his presentation, Slater wrote a script that allows you to play Twenty Questions with your machine using indico’s general text feature embeddings. Let’s tweak it to work with our finance-specific text embeddings. (Or, skip ahead to the Results section.)

Updating the Code

There are only a few pieces of the code that we need to change here. Instead of importing the NOUNS library, we’ll just add a small list of financial terms to the featurize_nouns function.

 finance_nouns = ["accounts payable", "accounts receivable", "accrual basis", "amortization", "arbitrage", "asset", "bankruptcy", "bond", "boom", "capital",
 "cash basis", "certificate of deposit", "commodity", "cost of capital", "cumulative", "debt", "deficit", "depreciation", "dividend", "economy", "equity",
 "exchange traded fund", "fiduciary", "fund", "gross domestic product",  "growth stock", "hedge fund", "internal revenue", "intrinsic", "invest", "invoice",
 "leverage",  "liability", "margin account", "margin call", "money market", "mortgage", "mutual fund", "paycheck", "portfolio", "premium", "profit",
 "real estate", "recession", "return", "revenue", "savings", "short selling", "stock", "trade", "Treasury bill", "treasury stock", "value", "volatility", "wager"
  ]

Underneath that, specify that you want to use indico’s finance feature embeddings by updating vectorize for the features variable.

features = vectorize(finance_nouns, domain="finance", batch_size=200)

You will need to do the same for the answer_features variable in the ask_question function.

answer_features = vectorize(options[int(response)], domain="finance")

Aaaand once again in the closest_difference function.

features = vectorize(list(args), domain="finance")

Now, change all instances of NOUNS across the script to finance_nouns.

You’re almost there! The last thing to change are the keywords you want to use as part of the twenty questions that the program should ask. These are the keywords we used:

twenty_questions([
    ["macro", "micro"],
    ["accounting", "investing", "economics"],
    ["gain", "loss"]
])

Results

When we run the program, we can see that the embeddings do pick up on “object properties” for these more abstract finance domain-specific terms. For instance, selecting “accounting”, “macro”, and “gain” results in the following guesses: internal revenue, asset, accounts receivable, margin account, and revenue.

20qs finance results
That’s pretty good — “internal revenue” and “asset” are both macro accounting terms (the only ones in the list, in fact). “Internal revenue” doesn’t really relate to “gain”, but perhaps the existence of “revenue” in the term moved it closer to that point. “Accounts receivable”, “margin account”, and “revenue” are micro accounting terms, but that’s because we ran out of macro accounting terms to choose from.

Note that our embeddings were trained on quarterly financial reports, which is why they perform better for accounting terminology versus investing and economics concepts. It’s fairly easy to update the embeddings to deal better with investing and economics nomenclature, though. Using Custom Collection and CrowdLabel, you can customize indico’s finance embeddings by retraining them with labeled investment documents to better suit the task at hand.
Once trained, the embeddings can be implemented in fuzzy search tools, FAQs bots, and numerous other applications. If you need help with training or want to learn more, feel free to reach out at contact@indico.io!

[addtoany]

Increase intake capacity. Drive top line revenue growth.

[addtoany]

Unstructured Unlocked podcast

April 10, 2024 | E44

Unstructured Unlocked episode 44 with Tom Wilde, Indico Data CEO, and Robin Merttens, Executive Chairman of InsTech

podcast episode artwork
March 27, 2024 | E43

Unstructured Unlocked episode 43 with Sunil Rao, Chief Executive Officer at Tribble

podcast episode artwork
March 13, 2024 | E42

Unstructured Unlocked episode 42 with Arthur Borden, VP of Digital Business Systems & Architecture for Everest and Alex Taylor, Global Head of Emerging Technology for QBE Ventures

podcast episode artwork

Get started with Indico

Schedule
1-1 Demo

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.
Subscribe to our blog

Get our best content on intelligent automation sent to your inbox weekly!