This spring I presented a talk entitled “Effective Transfer Learning for NLP” at ODSC East.
The talk was intended to demonstrate how surprisingly effective pre-trained word and document embeddings are at low training data volumes, and to lay out a set of practical recommendations for applying these techniques to your own tasks. Thanks to some excellent research by Alec Radford and the team at OpenAI, our recommendations are beginning to change.
To explain why the tides are shifting, let’s first walk through the rubric we use at Indico to evaluate whether or not a novel machine learning method is viable for industry use.
Evaluating Novel Machine Learning Methods
For broad practical application, a machine learning model must check off most of the following requirements:
- Quick to train
- Quick to predict
- Require minimal or no hyperparameter tuning
- Function well at the lower extremes of training data availability (100s of examples)
- Applicable across a broad range of tasks + domains
- Scale well with labeled training data
Let’s see how well pre-trained word + document embeddings satisfy these requirements:
-
- Quick to train
Training lightweight models on top of pre-trained embeddings is possible in seconds, although computing pre-trained embeddings is dependent on base model complexity.
- Quick to train
-
- Quick to predict
Prediction is similarly fast. Inference is only as expensive as the base model.
- Quick to predict
-
- Require minimal or no hyperparameter tuning
Cross-validation of regularization parameters and embedding type helps but this
operation is cheap enough that it’s not problematic.
- Require minimal or no hyperparameter tuning
-
- Function well at the lower extremes of training data availability (100s of examples)
Applying a logistic regression block on top of a pre-trained word embedding only
requires learning 100s to low 1000s of parameters rather than millions, and the simplicity
of the model means it requires very little data to achieve good results.
- Function well at the lower extremes of training data availability (100s of examples)
-
- Applicable across a broad range of tasks + domains
Pre-trained word + document embeddings are generally “good enough”,
but domain and task alignment of the source + target models matters a fair amount.
- Applicable across a broad range of tasks + domains
-
- Scale well with labeled training data
This approach caps out rather quickly and doesn’t benefit much from additional training data. Learning linear models helps produce better results with less data, but means that the capacity of the model to learn complex associations between
inputs and outputs is much lower.
- Scale well with labeled training data
In short, using pre-trained embeddings is computationally cheap and performs well at the lower extremes of training data availability, but using static representations imposes an unfortunate cap on the benefit gained from additional training data. Getting good performance out of pre-trained embeddings requires searching for the right pre-trained embeddings for a given task, but it’s difficult to predict whether a pre-trained representation will generalize well to a new target task. If no feature representation is a good match, you’re simply out of luck.
Computer Vision’s Transfer Learning Solution
Thankfully, research from the computer vision domain offers up a viable alternative. In the computer vision community, using pre-trained feature representations has been largely superseded by methods that “finetune” pre-trained models rather than just learning the final classification layer. All of the weights of the source model are modified, rather than simply reinitializing and learning the weights of the final classification layer. As training data availability increases, this additional model flexibility starts to pay dividends.
The basics of this approach are several years old — Yosinski, Clune, Bengio, and Lipson explored the transferability of convolutional neural network (CNN) parameters in 2014, but it wasn’t until more recently that this process became common practice. Now CNN finetuning is commonplace enough that Stanford’s undergraduate Computer Vision course (CS231N) teaches the process as a portion of their curriculum, and a 2018 paper by Mahajan, et. al (“Exploring the Limits of Weakly Supervised Pretraining”) suggests that finetuning should *always* be used in place of pre-trained features when model performance is of utmost importance.
Model Finetuning for Natural Language Processing
So why has the natural language processing community lagged so far behind? In his recent blog post, “NLP’s ImageNet moment has arrived”, Sebastian Ruder contends that the reason is the lack of an established dataset and source task for learning generalizable base models in his recent blog post. Until recently, the natural language processing community was lacking its ImageNet equivalent — a standardized dataset and training objective to use for training base models.
However, recent papers like Howard and Ruder’s “Universal Language Model Fine-tuning for Text Classification” and Radford’s paper “Improving Language Understanding by Generative Pre-Training” have demonstrated that model finetuning is finally showing promise in the natural language domain. Although the source dataset varies across these papers, the community seems to be standardizing on a “language modeling” objective as the go-to for training transferable base models.
Language modeling, simply put, is the task of predicting the next word in a sequence. Given the partial sentence “I thought I would arrive on time, but ended up 5 minutes ____”, it’s reasonably obvious to the reader that the next word will be a synonym of “late”. Effectively solving this task requires not only an understanding of linguistic structure (nouns follow adjectives, verbs have subjects and objects, etc.) but also the ability to make decisions based on broad contextual clues (“late” is a sensible option for filling in the blank in our example because the preceding text provides a clue that the speaker is talking about time.) In addition, language modeling has the desirable property of not requiring labeled training data. Raw text is abundantly available for every conceivable domain. These two properties make language modeling an ideal fit for learning generalizable base models.
The language modeling objective wasn’t the only advance necessary to make model finetuning for NLP viable, however. Using “Transformer” models in place of the typical recurrent models (LSTMs) has also played a major role. In “Improving Language Understanding by Generative Pre-Training” we see a notable delta between the performance of a finetuned transformer model and a finetuned recurrent (LSTM) model. LSTMs are no longer the de-facto standard for sequence modeling — non-recurrent models have shown competitive performance on a wide range of tasks.
John Miller dives into this recent trend in his blog post, “When Recurrent Models Don’t Need to Be Recurrent”, suggesting that the infinite memory that LSTM’s have in theory may not actually be in practice. In addition, a fixed context window memory may be more than sufficient to solve tasks like language modeling, and the residual block structure of the Transformer model seems to lend itself well to transfer learning applications. In short, the theoretical disadvantages of Transformer models are outweighed by practical advantages, like faster training and inference times.
Does Model Finetuning Meet Our Criteria?
In light of this recent advance, let’s re-fill out our rubric to see how well model finetuning meets our requirements.
-
- Quick to train
Although expensive compared to pre-computed feature representations, OpenAI’s transformer model can be finetuned on a few hundred examples in ~5 minutes with modern GPU hardware.
- Quick to train
-
- Quick to predict
Inference is also notably more expensive. Throughput is limited to single digits of documents per second. Improvements to inference speed are necessary before widespread application is practical.
- Quick to predict
-
- Require minimal or no hyperparameter tuning
The default hyperparameters work surprisingly well across tasks, although basic cross-validation to search
for ideal regularization parameters is beneficial.
- Require minimal or no hyperparameter tuning
-
- Function well at the lower extremes of training data availability (100s of examples)
Model finetuning performs as well as using pre-trained embedddings at data volumes as low as 100 examples.
- Function well at the lower extremes of training data availability (100s of examples)
-
- Applicable across a broad range of tasks + domains
Domain + task alignment seems to matter less than it does with pre-trained feature representation.
The language modeling objective seems to learn features that are applicable to both semantic and syntactic tasks.
- Applicable across a broad range of tasks + domains
-
- Scale well with labeled training data
Using a dramatically more complex model means that tasks that are simply not solvable using pre-trained feature representations are possible with sufficient training data. The gap between pre-trained features and model finetuning widens substantially as more training data becomes available. In fact finetuning seems to often be preferable to training from scratch — OpenAI’s paper “Improving Language Understanding by Generative Pre-Training” reported new state of the art results on 9 of the 12 datasets they evaluated on.
- Scale well with labeled training data
Although it has its limitations, model finetuning for NLP tasks has significant promise and has already shown clear advantages over the current best practice of working with pre-trained word and document embeddings.
Sebastian Ruder sums up the situation nicely:
“The time is ripe for practical transfer learning to make inroads into NLP. In light of the impressive empirical results of ELMo, ULMFiT, and OpenAI it only seems to be a question of time until pretrained word embeddings will be dethroned and replaced by pretrained language models in the toolbox of every NLP practitioner. This will likely open many new applications for NLP in settings with limited amounts of labeled data. The king is dead, long live the king!”
A Quantitative Look at Model Finetuning for NLP
Our early benchmarks confirm that there’s a real and generic benefit to model finetuning over the use of pre-trained representations. Below is a sample of the output from a recent benchmark obtained using Enso, our transfer learning benchmarking tool.
Each point represents the mean + 95% confidence interval from 5 trials on a random subset of the full dataset. The X axis represents the amount of labeled training datapoints available, and the Y axes are mean ROC AUC and accuracy respectively. Training datasets were oversampled before fitting. These results are the aggregation of fitting ~2500 models.
The finetuned model is an implementation of OpenAI’s transformer language model. The baseline model is cross-validated logistic regression trained on glove word embeddings. Although stronger baselines exist for comparison, the mean of glove embeddings is a surprisingly strong baseline for most classification tasks at this training data volume. We hope to publish comparisons against other methods in future benchmarks, but it’s still striking that model finetuning outperforms the naive baseline with as few as 100 labeled training examples.
Complete benchmarks on 23 different classification tasks are available for download on s3.
“Finetune”: Scikit-Learn Style Model Finetuning for NLP
In light of this recent development, Indico has open sourced a wrapper for OpenAI’s work on transformer model finetuning that’s appropriately named “finetune”. It’s our attempt to make the research done by Radford more widely accessible by packaging it in an easy to use, scikit-learn style library.
In part two of this blog post series, we’ll give a practical introduction to using finetune to see improvements on your own tasks with a few short lines of code. In the meantime, feel free to check out finetune’s documentation to get started on your own!