Is Euclidean distance meaningful for high dimensional data?

The short answer is no. At high dimensions, Euclidean distance loses pretty much all meaning.

However, it’s not something that’s the fault of Euclidean distance in particular (though there are distance metrics that work better at high dimensions than Euclidean).

The main issue is something commonly referred to as the “Curse of Dimensionality”. It’s very unintuitive, but also a common and insidious issue that will plague anything you do in a high-dimensional space.

Let’s be clear though. By “high-d” we’re talking hundreds to thousands of dimensions for a dense vector (sparse vectors are a completely different topic). Basically once you get up to high-dimensionality, pairwise distance between all of your points approaches a constant. Not zero, not infinity, but a constant.

Now, there are several important caveats here, and quite frankly the curse of dimensionality isn’t something that we understand very well outside of toy examples.

First – this pattern starts to fall away if your different dimensions are correlated. If you can do a PCA or something similar to re-project into a lower-d space with a small amount of loss, then your distance metrics are probably still meaningful, though this varies case by case.

Second – this isn’t something as easy as “just use this other distance metric”. The critical problem here is sparsity, and the value of any distance metric at high-d. In a k-nn scenario it’s usually still the case that the relative distances between points have meaning, but just that the absolute distance have much less of it. A lot of modern manifold layout algorithms attempt to circumvent this problem by throwing out the distance and instead only considering narrow “neighborhoods” of nearest neighbors, though many approximate nearest neighbors solutions (such as barnes hut) become very ineffective at high-d. This is largely because the assumptions around the efficacy of linear sub-division of the underlying space fall away.

To address the second point there are interesting techniques like voronoi clustering that help to mitigate some of these issues.

In general it depends a lot on the use case, but if you’re using Euclidean distance in a space that has hundreds or thousands of independent variables, you should get very paranoid about your assumptions very quickly.

View original question on Quora >

Follow Slater on Quora >>

Effective January 1, 2020, Indico will be deprecating all public APIs and sunsetting our Pay as You Go Plan.

Why are we deprecating these APIs?

Over the past two years our new product offering Indico IPA has gained a lot of traction. We’ve successfully allowed some of the world’s largest enterprises to automate their unstructured workflows with our award-winning technology. As we continue to build and support Indico IPA we’ve reached the conclusion that in order to provide the quality of service and product we strive for the platform requires our utmost attention. As such, we will be focusing on the Indico IPA product offering.

[addtoany]

Increase intake capacity. Drive top line revenue growth.

Schedule Demo

Resources

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Upcoming Webinar

From Automation to Agency: Indico Data Unveils the Future of Insurance with Agentic AI

Technology

Solutions

Why Indico

By Industry

By Use Case

By Role

Services

Resources

Documentation

Customer Stories

Partners

Find a Partner

Become a Partner

Partner Portal

Company

Press & Events

Careers

BLOG

Is Euclidean distance meaningful for high dimensional data?

Follow Slater on Quora >>

Effective January 1, 2020, Indico will be deprecating all public APIs and sunsetting our Pay as You Go Plan.

Increase intake capacity. Drive top line revenue growth.

Related Posts

Ask Slater, Machine Learning

What is a tensor in physics terminology and what’s the difference from a tensor in machine learning and AI?

Ask Slater, Machine Learning

How does the ELMo machine learning model work?

Ask Slater, Machine Learning

Should we remove duplicates from a data-set while training a Machine Learning algorithm (shallow and/or deep methods)?

See how Indico Data’s AI-driven solutions can revolutionize your decision-making processes.

Schedule
1-1 Demo

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.

Upcoming Webinar

From Automation to Agency: Indico Data Unveils the Future of Insurance with Agentic AI

Technology

Solutions

Why Indico

By Industry

By Use Case

By Role

Resources

Documentation

Customer Stories

Indico Named as Major Contender and Star Performer in Everest Group's PEAK Matrix® for Intelligent Document Processing (IDP)

BLOG

Is Euclidean distance meaningful for high dimensional data?

Effective January 1, 2020, Indico will be deprecating all public APIs and sunsetting our Pay as You Go Plan.

Increase intake capacity. Drive top line revenue growth.

Related Posts

Ask Slater, Machine Learning

What is a tensor in physics terminology and what’s the difference from a tensor in machine learning and AI?

Ask Slater, Machine Learning

How does the ELMo machine learning model work?

Ask Slater, Machine Learning

Should we remove duplicates from a data-set while training a Machine Learning algorithm (shallow and/or deep methods)?

See how Indico Data’s AI-driven solutions can revolutionize your decision-making processes.

Schedule1-1 Demo

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.

Get our best content on intelligent automation sent to your inbox weekly!

Schedule
1-1 Demo