Takeaways and favorite papers from ICLR
Last week, three members of indico’s Advanced Development team attended the International Conference on Learning Representations (ICLR). ICLR focuses mainly on representation learning — or working with raw data to build better features to solve complex problems. This covers ideas such as deep learning, kernel learning, compositional models, as well as non-convex optimization. Of all of the conferences out there, the topics presented at ICLR most closely align with what we do here at indico. In this post I will go over some trends I noticed, as well as a few favorite papers.
In machine learning, one is always trying to optimize some kind of loss function. Traditionally, this is something like cross entropy or mean squared error. Sadly though, for some use cases, this is not the loss you want your final model to be good at. Take anything in the space of image generation, for example. To demonstrate a common error case, consider a model trying to generate a sharp edge. If a model is uncertain as to where exactly this edge should be, it will try to minimize the loss function to the best of its ability, instead of making a guess like a human would do. If this loss is mean squared error (or really any loss in pixel space) the model will output a blurry line — effectively averaging among all possible predictions. This blurry prediction has a lower loss value than a sharp random guess.
This is not what we want in a generative model. Ideally, we want to fool a human looking at the image. Sadly, human decisions can’t simply be plunked into a neural network, so we need some kind of proxy — preferably one that we can take the gradient of. The idea behind adversarial training is to turn this proxy into another neural network so that you essentially have an entire neural network as a loss function. The question then becomes how to train and work with these “adversarial” neural networks.
Currently, this technique is used quite a bit for general image modeling. Deep Directed Generative Models with Energy-Based Probability Estimation by Taesup Kim and Yoshua Bengio used this concept with energy based models (think probabilistic models without any normalization) to solve their otherwise intractable sampling. It was also applied and shown to work quite well with “next frame” video prediction: Multi-Scale Video Prediction Beyond Mean Squared Error by Michael Mathieu, Camile Couprie and Yann LeCun.
Additionally, these adversarial loss functions let us do things that would not be possible (or at least easy) with traditional methods. These include training autoencoders whose code space is any sample-able distribution: Adversarial Autoencoders by Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow. They demonstrate their autoencoders by encoding to a 2D uniform distribution for easy visualization of high dimensional data.
We can also perform tasks like removing culturally sensitive information from our model’s predictions. Say, for example, you want to remove the notion of race when making predictions. This can be done by training an adversary to predict this information and optimizing such that the adversary performs poorly. More information can be found in Censoring Representations with an Adversary by Harrison Edwards, and Amos Storkey.
These methods are not without their faults, however. First and foremost in my mind is that there is no easy way to evaluate them. In fact, existing methods of evaluation are just completely broken in a number of ways. In a few metrics it is actually easy to perform better than a held out test set using something like KMeans. A Note on the Evaluation of Generative Models by Lucas Theis, Aäron van den Oord, Matthias Bethge shows that really the only way to evaluate these models is to measure performance with respect to the target use case.
I feel as if these papers are just the beginning of a big shift in how we train and think about these models. I am looking forward to NIPS in the hopes that more cool techniques will be released!
Another trend I noticed was on improving optimization. In my opinion, the field still doesn’t really have a good idea of how to optimize these deep models — specifically the deep recurrent ones with all the new fancy attention techniques. There was a large body of work moving this understanding forward. Some of my favorites in no particular order are below:
Adding Gradient Noise Improves Learning For Very Deep Networks by Arvind Neelakantan, Luke Vilnis, Quoc V. Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, James Martens. In this paper, they take traditionally complex models, like Neural Turing Machines and Memory Networks, and use the simple trick of adding gradient noise to drastically improve convergence.
Regularizing RNNs by Stabilizing Activations by David Krueger, Roland Memisevic. They propose a regularization cost that tries to keep the norm of the activations the same, arguing that at test time the activations in an RNN can grow so large (becoming unstable) that any new input will not be able to change the output. By adding this cost, they ensure that this problem will not happen and thus allow for more stable predictions.
GradNets: Dynamic Interpolation Between Neural Architectures by Diogo Almeida, Nate Sauder. In this paper, they solve complex optimization tasks by interpolating between different neural network designs. They interpolate between things like linear activation to relu, or linear to batch normalized. Somehow they obtain better results than either of the two networks when trained to convergence!
There are so many more interesting papers that came out of this conference and I’m just scratching the surface of them, both in the conference and workshop tracks! Personally, I’m going to go back through the exciting work done in reinforcement learning and on Gaussian processes. Luckily for everyone, all of these papers are freely accessible and indexed.
Lastly, if you’re interested in generative modeling, check out indico’s submission: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks by Alec Radford, Luke Metz, and Soumith Chintala (Facebook AI). If you have any questions about machine learning that you’d like us to cover on our blog, feel free to reach out to us via email at firstname.lastname@example.org – or through the little chat bubble in the bottom right-hand corner of your screen!