Sorting content into categories is a key task for recommendation systems, as well as for general data management. We’ve talked a lot about text data lately — for example, using topic tags to improve article suggestions for your readership, and how you can build a custom text classifier for your specific industry or task. We didn’t even need large datasets to build those tools — and the same applies to images. Using our customizable machine learning API, Custom Collection, you can easily build a model to auto-tag images, streamlining and improving visual content recommendations and management.
The Task
Build a simple interior design style classifier using Custom Collections. More specifically, we’ll be testing to see how well it can distinguish three fairly similar styles: contemporary, minimalist, and industrial.
The Data
Our dataset consists of 21 images of rooms — 7 per style — that I grabbed from Google Images (all labeled for reuse). That’s right, just 21. How is this possible? Custom Collections uses a machine learning technique called transfer learning, which allows us to build models with very small datasets. In fact, depending on the difficulty of the task you’re trying to achieve, you’ll start seeing diminishing returns after the first few thousand data points.
Take note though, it’s generally better to have 10 or more samples for each category (depending on the difficulty of the problem you’re trying to solve). It’s difficult to find good free-for-use images though, so we’ll have to make do.
Training the Model
If you want to follow along, clone the dataset and skeleton code from the Github repo. We’ll be working in Python.
Before we go any further, have you set up your free indico account yet? In case you haven’t, follow our Quickstart Guide. It will walk you through the process of getting your API key and installing the indicoio
Python library. If you run into any problems, check the Installation section of the docs. You can also reach out to us through that little chat bubble if you run into any trouble.
Assuming your account is all set up and you’ve installed everything, let’s get started.
Step 1: Labeling the Data
If you’re working with the dataset I provided (located in the images
folder), you’ll see that each image is named after the style it represents. Open up main.py
. In the generate_training_data
function, you’ll see that we grabbed those filenames and used them as the labels for each image. If you decide to use your own unlabeled dataset, you can use our CrowdLabel tool. I’m no design expert, so I may have inaccurately labeled a few of these images. CrowdLabel allows multiple people on your team to separately label datasets, increasing labeling accuracy by only using examples that multiple people have assigned the same label. Using CrowdLabel also lets you skip all of the code in this tutorial 😛
Step 2: Training Your Collection
The generate_training_data
function processed all our data and labels and prepared them so they can be passed into the Custom Collection API, which only takes in a list of items paired with a single label.
Now we can train our model! It’s actually incredibly easy.
Go to the top of your file and import indicoio
. Don’t forget to set your API key — there are a number of ways you can do it; I like to put mine in a configuration file.
import indicoio from indicoio.custom import Collection indicoio.config.api_key = 'YOUR_API_KEY'
Go back down to the bottom of your file and under if __name__ == "__main__"
, generate your training data, and define your empty Collection. Now, just add your data to the Collection and train!
if __name__ == "__main__": train = generate_training_data() collection = Collection("interior_design") for sample in tqdm(train): collection.add_data(sample) collection.train() collection.wait() collection.info()
Just like that. tqdm
is a progress bar that will inform you about how much data has been uploaded, and .wait()
will block until the training is complete. Since the dataset is so small, it should only take about a minute train, depending on how fast your Internet connection is.
Calling collection.info()
will check your Collection’s status, and return metrics that are useful indicators of the model’s performance. However, larger training set sizes are recommended for more reliable precision and recall metrics, so we’ll use a more hands-on way to test our model instead.
Testing the Model
First, let’s run some test examples through our model to see how it performs for all the categories. I set aside some images in the test_images
folder that weren’t in the training dataset. Comment out the code for training the model under if __name__ == "__main__"
, and then run the following code.
if __name__ == "__main__": collection = Collection("interior_design_2") test_model()
Generally, we can assume that the highest probability result is the category that the model thinks the image most likely belongs to. Your results should appear as below (note that slight variations in the numbers are normal — they should just be roughly the same). The test images for each category appear above the results here.
Test results for CONTEMPORARY category: {u'minimalism': 0.1399256475, u'industrial': 0.1386253738, u'contemporary': 0.7214489787} {u'minimalism': 0.28305400940000003, u'industrial': 0.00672567, u'contemporary': 0.7102203205000001} {u'minimalism': 0.3341266282, u'industrial': 0.0198866961, u'contemporary': 0.6459866757} ******
Test results for INDUSTRIAL category: {u'minimalism': 0.1785425491, u'industrial': 0.5101019466, u'contemporary': 0.31135550430000003} {u'minimalism': 0.0526127544, u'industrial': 0.5505277214000001, u'contemporary': 0.39685952420000004} {u'minimalism': 0.3112188771, u'industrial': 0.3916452771, u'contemporary': 0.2971358458} ******
Test results for MINIMALISM category: {u'minimalism': 0.9069708701, u'industrial': 0.0880693949, u'contemporary': 0.004959735000000001} {u'minimalism': 0.9149341994, u'industrial': 0.0394596009, u'contemporary': 0.0456061997} {u'minimalism': 0.6548668879, u'industrial': 0.0572945461, u'contemporary': 0.287838566}
Looks like the model did alright! If, however, the model had not performed satisfactorily, we could try adding more examples of the underperforming category to the Collection’s training dataset, and retrain the model.
Next Steps
Where to from here? Try expanding the system by adding more styles, or applying this tutorial to other categories, like clothes, food, or art. Or, go a step further — can you adapt our fashion matching tutorial, which also uses the structure of a classification problem, to build a model that matches pieces of furniture to the style of already existing rooms?
Effective January 1, 2020, Indico will be deprecating all public APIs and sunsetting our Pay as You Go Plan.
Why are we deprecating these APIs?
Over the past two years our new product offering Indico IPA has gained a lot of traction. We’ve successfully allowed some of the world’s largest enterprises to automate their unstructured workflows with our award-winning technology. As we continue to build and support Indico IPA we’ve reached the conclusion that in order to provide the quality of service and product we strive for the platform requires our utmost attention. As such, we will be focusing on the Indico IPA product offering.