Why do machine learning algorithms require large amounts of computer space for all kinds of datasets?

Depends a bit on which model type you’re talking about, but for the purposes of illustration I’m going to assume that we’re talking about something like a Logistic Regression Model on top of a tf-idf vector.
In this situation, as well as many others, model size is more or less constant with dataset size. In this case, you need to store about one parameter per term in your tf-idf vector. Depending on exactly what’s in your vector (bigrams, trigrams, etc…) you can assume that you’ve got about 10k entries in your tf-idf vector, largely independent of dataset size. Whether your dataset is 1,000 examples or 100,000 examples, your tf-idf vector is going to be about the same size, so you need to store the same amount of data.

Now, in this case the model is quite small. Assuming that you’re storing these parameters in eight bytes (which is probably overkill) you’re looking at a model that’s hundreds of KB to low MB, but could still be pretty large relative to the size of your dataset.

Now, that’s a Logistic Regression model, which, generally speaking, is very parameter-efficient. Deep learning models tend to sit on the opposite side of the spectrum. If we look at a modern NLP problem (The Stanford Natural Language Processing Group) we’ll see that many of the high-performing solutions have tens of millions of parameters. Using the same basic assumption above of 2 bytes per parameter, we’ve got a model that’s going to be in the hundreds of MB.

That’s a pretty normal range for a modern deep learning model. Something in the realm of hundreds of MB to low GB is probably ~80% of modern models unless particular effort has been taken to reduce the number of parameters and thus the model size (note: this is a very active area of research typically referred to as distillation).

The important thing to note though is that generally the size of your model is independent of the size of your dataset. As learning progresses you get better parameter values and the exact contents of your model change, but as you are not changing the model itself (again, this is a generality, not a hard and fast rule) your model size isn’t varying and stays wherever you started.

Now, “large amount of computer space” is relative. Typically the couple of GB that a hefty deep learning model takes shouldn’t be significant in terms of storage. A dataset that is a couple of GB would be considered quite small in most contexts, and any modern computer is able to hold dozens to hundreds of these models, which is far beyond what is needed in most cases.

The real issue comes to run-time memory consumption. During training you’re managing much more than just the static model footprint on disk and have to keep a lot of intermediary states in memory which can easily case your memory footprint to increase several-fold, leading to runtime memory footprints that can easily be upward of 4GB depending on the architectures you’re working with. This is a problem because many GPUs only have 4GB of onboard memory. This is a problem that you’ll run into pretty frequently, especially in the language domain, and investing in a GPU with 8+GB of memory is highly-advised. especially if you’re using Tensorflow which is very greedy when it comes to memory footprint.

View original question on Quora >

Follow Slater on Quora >>

[addtoany]

Increase intake capacity. Drive top line revenue growth.

Schedule Demo

Resources

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Upcoming Webinar

From Automation to Agency: Indico Data Unveils the Future of Insurance with Agentic AI

Technology

Solutions

Why Indico

By Industry

By Use Case

By Role

Services

Resources

Documentation

Customer Stories

Partners

Find a Partner

Become a Partner

Partner Portal

Company

Press & Events

Careers

BLOG

Why do machine learning algorithms require large amounts of computer space for all kinds of datasets?

Increase intake capacity. Drive top line revenue growth.

Related Posts

Ask Slater, Machine Learning

What is a tensor in physics terminology and what’s the difference from a tensor in machine learning and AI?

Ask Slater, Machine Learning

How does the ELMo machine learning model work?

Ask Slater, Machine Learning

Should we remove duplicates from a data-set while training a Machine Learning algorithm (shallow and/or deep methods)?

See how Indico Data’s AI-driven solutions can revolutionize your decision-making processes.

Schedule
1-1 Demo

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.

Upcoming Webinar

From Automation to Agency: Indico Data Unveils the Future of Insurance with Agentic AI

Technology

Solutions

Why Indico

By Industry

By Use Case

By Role

Resources

Documentation

Customer Stories

Indico Named as Major Contender and Star Performer in Everest Group's PEAK Matrix® for Intelligent Document Processing (IDP)

BLOG

Why do machine learning algorithms require large amounts of computer space for all kinds of datasets?

Increase intake capacity. Drive top line revenue growth.

Related Posts

Ask Slater, Machine Learning

What is a tensor in physics terminology and what’s the difference from a tensor in machine learning and AI?

Ask Slater, Machine Learning

How does the ELMo machine learning model work?

Ask Slater, Machine Learning

Should we remove duplicates from a data-set while training a Machine Learning algorithm (shallow and/or deep methods)?

See how Indico Data’s AI-driven solutions can revolutionize your decision-making processes.

Schedule1-1 Demo

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.

Get our best content on intelligent automation sent to your inbox weekly!

Schedule
1-1 Demo