Hello, friendly human! Welcome to the first in a series of articles about using machine learning to solve problems. We start with fundamental concepts, and later explain how to implement those concepts using Python + TensorFlow code. Then we’ll show how to combine and extend these fundamental concepts to solve more interesting problems.
An unproductive pattern
It is now easy to find tutorials with TensorFlow code (examples: 1, 2, 3, 4, 5). But we’ve found that many smart people who are learning TensorFlow are also diving deep with machine learning for the first time. This can lead to a common (unproductive) pattern where you think you are stuck on an implementation detail, when you are actually stuck on the thinking process:
- Search for “TensorFlow tutorial”. Load a few from the front page into tabs.
- Follow a simple tutorial. Great, it worked easily. I’m using deep neural networks!
- Try to solve a new problem using something very complicated, like a seq2seq LSTM RNN with input embedding, attention, sampled softmax, bucketing, etc.
- Hmm, something isn’t working, and these errors are pretty hard to decipher.
- Search again. Maybe someone else has exactly solved my problem already…
Clear thinking process >> writing code
If there is a secret to using machine learning like a pro, here it is: the important part of machine learning isn’t writing code, it is the scientific thinking process of translating your unique problem into a task that can be solved using data + machine learning + tools (like TensorFlow) + information available on the internet. Let’s get started!
This article introduces a common set of machine learning fundamentals, which can be used to understand most machine learning solutions. We’re not presenting a grand unifying opus of machine learning theory here, just a set of patterns that can help you think more productively about machine learning. These patterns are the basis for many kinds of models, from simple linear regression to deep neural networks.
Here we start simple, introducing the fundamental concepts and machine learning workflow. We’ll show how the machine learning thinking process is very similar to the scientific thinking process. In Part 2 we’ll put them to work using a very simple machine learning model to solve the task of optical character recognition. Later articles will build upon these concepts, from implementing a basic neural network to explaining how various improvements throughout the 1990’s and 2000’s made it possible to train the “deep” neural networks that are so successful today.
Who should read?
- You have a good working knowledge of Python. You like to read tutorials and work through the steps on your own. You understand the basic idea of machine learning—it is a way to program a machine to do things, using data instead of hard-coded rules. These are the only hard requirements. If you don’t know Python yet, don’t cheat yourself. Never before have machine learning tools been more accessible or more powerful, so if you have any interest in developing skill with machine learning, do it the right way.
- “Help! I searched for
TensorFlow tutorialand found pages of results. Which one is the best to learn from?” The best way to learn is to implement the stuff you read—satisfy your curiosity with experiments. Unfortunately, most tutorials out there today are side effects of the author’s own learning process, and are not necessarily optimized for your learning. We’re writing these articles for readers who want to understand the fundamentals of machine learning and see how to implement those concepts simply, using TensorFlow.
- “I saw that Google uses a bidirectional LSTM sequence-to-sequence neural translation model using sampled softmax and bucketing in the seq2seq tutorial. It sounds impressive so it must be the best! I’ll start there!” Haha, this is actually how I started with TensorFlow. Be smarter than me! Even armed with the advantages of good working knowledge of machine learning, access to helpful colleagues/community, and familiarity with the tasks/datasets, it would have been better to start simple.
- You have already read some tutorials and followed along, but you find yourself struggling to apply the concepts when you try to solve something more novel. Fear not, this is also a symptom of starting with the complicated stuff. The cure is a dose of the fundamentals.
Who should *not* read? (or read away, but it might not be what you’re looking for ;))
- PhD students. If you are already a machine learning researcher with a good understanding of the fundamentals and methods of machine learning, and you are already familiar with Torch or Theano, check out Alec’s and Nathan’s repos first:
- The original barebones intro to machine learning fundamentals, with code in Theano. Alec made the repo to accompany a talk at Tufts University:
- Nathan ported ^ into Tensorflow, plus some more advanced modules like an autoencoder and LSTM:
- Seasoned machine learning hackers. Hopefully you know most of this already. But sometimes it is nice to read a refresher, so don’t let experience stop you.
- Statisticians. If you expect a certain level of statistical rigor being discussed and justified here, sorry, that isn’t going to happen. Nobody really understands why some ML methods, like neural networks, work as well as they do. But then, nobody really understands why mathematics or statistics work so well, either.
- True beginners. If you want a gentler introduction to the methods than the quick, practical, hacker version presented here, you might enjoy the Udacity course by Vincent Vanhoucke. The course is great, and as one of the TensorFlow tech leads at Google Brain, Dr. Vanhoucke is an authoritative instructor.
Why isn’t there a repo for this?
Because we programmers are lazy, and learning really sinks in when we are forced to explore and struggle a bit. Inefficiencies are *good* for learning. All the code is freely available here, but you’ll have to think through things and construct the pieces yourself. If we put up a repo, you will clone it, thereby short-circuiting your learning process. Besides, there are good repos for this already (see above).
Recommend a good machine learning reference book?
You can’t go wrong with this book by Goodfellow, Bengio, and Courville. The online version is currently free.
- Read about TensorFlow in the official docs. If you see a word or concept that seems strange, check the docs.
- (optional) If you want to use GPU acceleration, download and install CUDA and cudnn from nVidia.
- Install TensorFlow, using the official docs. Linux is recommended, but OS X also works. The `pip` install usually works, and is faster than installing `bazel` and compiling from source.
- Run this simple test. There’s no point in going further until you have a working TensorFlow installation. If you can’t get the GPU version working, you can always `pip` install the CPU version for now, and update to the GPU version later.
What is Tensorflow and what is the significance?
TensorFlow is the open source implementation of Google’s second-generation machine learning system. There are many reasons why people are excited about TensorFlow. As practical problem solvers, we care about this tool in particular because it helps us do experiments faster. The following features of TensorFlow make fast experimentation possible:
- Symbolic differentiation capability means we can write Python code that is very much like the mathematical expressions in research publications, and TensorFlow can compile it to executable (numerical) code. Researchers (and the articles we write) use symbolic math because it is usually easier to think that way. But translating the symbolic math into good numerical implementations can be difficult, and there simply aren’t enough researchers who can also write good numerical implementations. It’s a totally different skill set. So being able to write symbolic code and compile to efficient numerical code is both a time saver and an enabler for the scientific community. It is worth noting that we’ve had symbolic capability for a while now using tools like Theano, so Tensorflow isn’t offering a unique feature here. But thanks to the popularity of TensorFlow and neural networks, it is probably the first time many users are encountering the feature.
- Compile to CPU or GPU targets. Training machine learning models is very computationally expensive. Using a tool like TensorFlow, we can exploit the power of GPUs to do many simple mathematical operations, very quickly, in parallel. Speedups of 30X+ are typical. That’s like waiting for a day for your GPU vs. waiting for a month for CPU. But in order to take advantage of GPUs, you need to feed the GPU with instructions that exploit parallelism, and this can be very tricky to implement, especially if you have already started down a solution path.
TensorFlow can compile Python code (very quickly) to both CPU and GPU targets, using very minor code changes required to switch between the two. Plus, TensorFlow now has a very robust interface for multi-GPU support.
- Constraints. TensorFlow implements a number of constraints and abstractions that guide our thinking about a problem and how to prepare data to solve it. This helps us to think of solutions that exploit parallelism, modularity, etc. Constraints inspire creativity; not to be underestimated!
- Model lifecycle management can be a major pain. How do we reload a model from a saved checkpoint and resume training? How to deploy a model to a remote server? How do we visualize a model architecture to confirm we have implemented what we intended? How to stream data to/from multiple sources? TensorFlow has features to address each of these pain points, many of which are only revealed when maintaining multiple models over time.
- Python has always had a very helpful community, and folks have rapidly embraced TensorFlow. If you don’t have the advantage of helpful + knowledgeable colleagues, you can find answers using Github, StackOverflow, Reddit, and more. If you are stuck for more than ~30 minutes, ask a friend or find some help on the internet.
Fundamental parts of the machine learning process
Learning by example
The whole point of machine learning is to save yourself the effort of discovering and implementing rules. As humans, we learn by example. We observe some example, form a mental model (i.e., a hypothesis) for how the thing works, and observe the effect. When an effect contradicts our mental model, we update our mental model to become more consistent with observations. We can memorize previous examples, and make predictions about examples we have never seen before by generalizing our mental model.
A scientific thinking process.
Compared to human sensory input, your computer program has very limited ways to “observe” things, but the modern workflow/interface for machine learning is otherwise very similar. Machine learning systems learn by example, too. Instead of sensory input, a model reads examples from a data stream, and updates its internal state to be consistent with the input. A model can memorize previous examples, and generalize to make predictions about examples it has not seen.
The modern machine learning process is very similar to the scientific process.
(Differences shown in purple).
Exploiting the technical advantages of machines
Computing machines have some undeniable advantages: speed, precision, reproducibility, the ability to share exact copies, tireless iteration, the economics of storage and computing power, etc. So when we look at learning as a process to optimize, it makes sense to think about how we can exploit these advantages.
Specifically, let’s start by being more explicit about the evaluation step. Given an input example, the model outputs a prediction. Then we evaluate the prediction to know how the model should be updated.
We might do it intuitively with our own mental models,
but we should be explicit about the machine learning process.
Loss value is feedback for learning
Computers are good at iterating through lists of examples, but how will we make sure the updates from next iteration are consistent with the updates from this one? One simple way is to write an “evaluate” function to yield a “loss” value. This function (usually called the “loss function” or “cost function”) compares the known answer with the predicted answer to compute a normalized score. Now we can apply the same function to every input example. As a bonus, we can normalize it to improve the dynamics of the learning process, typically using something like the log transform.
The concepts of loss function and loss value are simple and important. The loss value is the key feedback into the machine learning process, and it defines what objective your model will optimize. Without a loss value, your model would have no way to know which parameters to update, or how to update them. There are a number of reasonable loss functions, and you must use something that makes sense in the context of the problem you are trying to solve.
Split the “evaluate” and “update” into separate methods and define a “loss” value.
Optimizers update parameters
The final piece of the machine learning process is a method for using the loss value to update the model with new, hopefully better, parameters. For decades, many smart people have invested effort into solving the optimization problem: “given a sequence of loss values, what is the best way to update parameters?” A perfect optimizer would take one input, then perfectly guess the best parameters. But in this real world, optimizers aim to iteratively converge on parameters that predict better loss values—after many iterations without much improvement, the optimizer has “converged” and terminates. For some situations, you might want an optimizer that converges very quickly to an approximately correct set of parameters. For other situations, you might want to find the best parameters, no matter how many iterations it takes. Optimization is a very deep and technical topic; we’ll explore in greater depth in a future article.
Fortunately, because we have factored out the loss values, optimizers are implemented with a very modular interface. This is especially true of first-order iterative optimizers like gradient descent, which are the current best choice for neural networks trained with backpropagation. Thus, in practical use, switching from one optimizer to another is usually as easy as changing the function name and maybe modifying a few hyperparameters (e.g., learning rate).
All together, a very modular system
The machine learning process is very similar to the scientific thinking process of learning by example. But instead of using human sensory information to observe outcomes, machines must be fed data. By tweaking the process a bit, we can exploit the advantages of computers to get a much more powerful machine learning process. First we factorize out the evaluation of a loss function from the updating of the model. This lets us be more systematic about how we define the feedback or learning signal for the model, and it gives us a consistent interface to try various loss functions. Finally, we abstract out the process of optimization. Given loss values, an optimizer will try to update the model’s parameters to produce better loss values. These are the fundamental parts of the machine learning process.
Let’s review. To quickly check your comprehension, think or write your own answer, then highlight our answer by clicking and dragging your mouse over the space under the question.
- What is the main difference between the scientific thinking process and the machine learning process?
Humans use sensory input to observe outcomes, but machines need to be fed data explicitly.
- What are some reasons to use Tensorflow?
It provides constraints to think about how you will implement your solution. Features like symbolic differentiation, ability to compile to CPU or GPU targets, and tools to visualize the model you have defined in the TF graph.
- What is the best optimizer?
There is no best optimizer for all situations.
Machine learning terms
the specific challenge we are trying to solve using machine learning. You should define a task similarly to how you would write good comments or documentation: what is the input? what is the output? how will you know when the prediction is good enough?
a set of symbolic expressions + numerical parameters. By updating parameters in response to input examples, the model “learns” to compress information relevant to solving the task.
the process of converting input examples into the model’s internal representation. Typically refers to neural networks rather than traditional machine learning models.
- Feature engineering:
Similar to encoding, but typically refers to a manual process where the programmer uses domain knowledge to encode features she knows to be useful for the task.
The process of using a trained model to make predictions.
Similar to inference, but typically used in context with “encoding”.
- Loss function:
a method for comparing a predicted result versus a known result during training; outputs a loss value. The magnitude of the loss value indicates the amount of error for the given inputs.
If you like the content, please let us know via the little chat bubble!
Feedback helps us learn to write better stuff…
(credit: Giphy + Dos Equis)