Folks from are already quite famous for their cutting edge deep learning courses and they are just getting started. Yesterday they have published their research on pre-training Language Models for all kinds of NLP problems. And this research is awesome. Just take a look at these results:

Taken from

Blue line represents a fresh model trained only for this task, orange represents a pre-trained language model that was fine-tuned for the task, and green represents a pre-trained model that was fine-tuned as LM for this task’s dataset and then fine-tuned for it’s objective. The last option achieves better results than the first with 100 times less data!

Basically, their approach allows to use pre-trained LMs and get much better results with much less data. It’s pretty much what ResNets pre-trained on ImageNet did in Computer Vision. And they’ve published all the source code with a LM pre-trained on WikiText 103(103 millions of words), so feel free to use it in your research/projects.

How it works

In short, the recipe looks like this:

  1. Train a LM on a huge dataset or download pre-trained one.
  2. Fine-tune this LM on your data.
  3. Add a few layers and fine-tune it to solve the task at hand.
  4. Well done! You’ve probably just achieved SOTA result. Now you can pick another problem and return to step 2.

Now lets take a look at each step more closely.

  1. Just a 3-layer LSTM with carefully tuned dropout parameters as described in this paper(AWD-LSTM) trained on WikiText 103(28,595 preprocessed Wikipedia articles and 103 million words).
  2. Fine-tuning the language model from the previous step on our data. In general, it creates a problem when the model forgets what it has previously learned. To solve it authors propose 2 techniques: Discriminative Fine-tuning(to reduce learning rate for each previous layer, from last to first, by some factor, 2.6 in this particular case) and Slanted Triangular Learning Rates(to increase LR linearly first ~10% of iterations and then linearly decrease it).
  3. Finally, add some fully-connected layers and train the model for the task. To avoid catastrophic forgetting on this step authors propose Gradual Unfreezing(freeze all pre-trained weights first, and after each epoch unfreeze a layer, from last to first). Also, for classification tasks they divide large documents into batches and initialize the model with the hidden state of the last batch.
  4. They have already reported SOTA results for 6 tasks and more on the way.

This is something truly awesome. Classification in NLP was quite hard relatively to CV, but now we can train a good model with just a few hundred examples. And this approach works not only for classification, but almost any kind of NLP problem. I guess this research will have as much impact as word vectors had a few years ago. I’m really excited to see how many tasks can be finally solved with it.