Folks from fast.ai are already quite famous for their cutting edge deep learning courses and they are just getting started. Yesterday they have published their research on pre-training Language Models for all kinds of NLP problems. And this research is awesome. Just take a look at these results:
Blue line represents a fresh model trained only for this task, orange represents a pre-trained language model that was fine-tuned for the task, and green represents a pre-trained model that was fine-tuned as LM for this task’s dataset and then fine-tuned for it’s objective. The last option achieves better results than the first with 100 times less data!
Basically, their approach allows to use pre-trained LMs and get much better results with much less data. It’s pretty much what ResNets pre-trained on ImageNet did in Computer Vision. And they’ve published all the source code with a LM pre-trained on WikiText 103(103 millions of words), so feel free to use it in your research/projects.
How it works
In short, the recipe looks like this:
- Train a LM on a huge dataset or download pre-trained one.
- Fine-tune this LM on your data.
- Add a few layers and fine-tune it to solve the task at hand.
- Well done! You’ve probably just achieved SOTA result. Now you can pick another problem and return to step 2.
Now lets take a look at each step more closely.
- Just a 3-layer LSTM with carefully tuned dropout parameters as described in this paper(AWD-LSTM) trained on WikiText 103(28,595 preprocessed Wikipedia articles and 103 million words).
- Fine-tuning the language model from the previous step on our data. In general, it creates a problem when the model forgets what it has previously learned. To solve it authors propose 2 techniques: Discriminative Fine-tuning(to reduce learning rate for each previous layer, from last to first, by some factor, 2.6 in this particular case) and Slanted Triangular Learning Rates(to increase LR linearly first ~10% of iterations and then linearly decrease it).
- Finally, add some fully-connected layers and train the model for the task. To avoid catastrophic forgetting on this step authors propose Gradual Unfreezing(freeze all pre-trained weights first, and after each epoch unfreeze a layer, from last to first). Also, for classification tasks they divide large documents into batches and initialize the model with the hidden state of the last batch.
- They have already reported SOTA results for 6 tasks and more on the way.
This is something truly awesome. Classification in NLP was quite hard relatively to CV, but now we can train a good model with just a few hundred examples. And this approach works not only for classification, but almost any kind of NLP problem. I guess this research will have as much impact as word vectors had a few years ago. I’m really excited to see how many tasks can be finally solved with it.