Let me be clear: I love deep learning. It has radically improved the scope of problems that machine learning can be practically applied to. I’ve built a company on the back of deep learning and owe quite a lot to it.
DO NOT start with deep learning. I’m not saying that you shouldn’t approach deep learning at all, or even that you shouldn’t end up exclusively studying deep learning. To pre maturely focus on deep learning though I believe would be a short-sighted decision.
Deep learning isn’t a magic bullet, and in many (common) situations is in fact a very bad fit for problems that you may be faced with. The problem is that because deep learning is such a flexible and powerful tool it’s important to learn when to not use deep learning for a problem. In almost all situations deep learning is capable of attacking a problem, but there is a much smaller set of problems for which deep learning is practically effective or useful.
Let’s use an example:
Manager: We’re looking to predict customer churn. We only have about 10,000 customer records, and for each one we have about 10 categoric variables. We want to know if we can use those categoric variables to predict their churn.
Approach 1: Hmm, categoric information isn’t a perfect fit for any of the out-of-the-box architectures I’m familiar with. I’d probably set this up as a simple fully-connected network. Now, that said we don’t have much data so I probably need to play around with some smart initialization approaches to get the network in the right neighborhood. I’ll start getting a basic network up that we can test with and then we can iterate on the architecture to get to improve the accuracy. Do we have any GPUs lying around? It would be really helpful to have a couple to run experiments on.
Approach 2: I used a Random Forest from sklearn. It took five minutes, it’s well-fit to the problem definition and I was able to predict churn with 80% precision and 60% recall.
Could Approach 1 have worked? Certainly. Would Approach 1 have given the Manager higher accuracy? Maybe, but not likely. This isn’t a particularly complex problem and it’s not one where we’re likely to effectively use a combination of factors more complex than those mapped by a Random Forest. The data is well structured and it’s not very large, so our ability to learn anything truly sophisticated is limited.
Here’s another example: Bag of Words Meets Bags of Popcorn. This tutorial created a bit of a stir when it first came out. (Note: this tutorial is NOT deep learning despite the fact that it is billed as such. It is however indicative of deep learning solutions in the NLP space) It was originally intended to be an example of how powerful word2vec was. It was billed as a taste of what deep learning could do for NLP. There is a long, detailed tutorial for how to apply word vectors in this problem and it’s a great introduction to the topic for people that aren’t familiar.
Except that’s not how you should do the problem. It created a stir because in this problem (as in many others) the more complex solution actually performs worse than the most basic, obvious approach you could think of. You would be shocked at how strong of a benchmark logistic regression is on top of tf-idf vectors. It’s a great example of a very simple solution to what may often seem a very complex task.
The problem is that if you exclusively focus on deep learning very early on you will make these mistakes. You’ll miss obvious solutions sitting in front of your face, you’ll over-complicate solutions when something very simple and straightforward would have worked, and worst of all, you probably won’t even know. You’ll probably end up shipping a very expensive model into production only to learn months or years later that the model underperforms a simple benchmark. You’ll also find yourself falling off a deep end quickly when you try to move beyond being the ML version of a script-kiddie. Deep learning didn’t invalidate historic machine learning, it just built upon it. You’ll find that in more cutting edge deep learning papers there are a lot of callbacks to old research and a lot of assumption of fundamental knowledge that you would likely never get if you exclusively studied deep learning.