Encoder-decoders in Transformers: a hybrid pre-trained architecture for seq2seq

How to use them with a sneak peak into upcoming features 🕵️‍♀️

Our Transformers library implements many (11 at the time of writing) state-of-the-art transformer models. It is used by researchers and practitioners alike to perform tasks such as text classification, named entity recognition, question answering or text generation. Its API is compatible with both PyTorch and Tensorflow.

While many recent models have focused on single-stack architectures, encoder-decoders have come under the spotlight again recently, notably with Facebook’s BART and Google’s T5.

This post briefly goes through the (modern) history of transformers and the comeback of the encoder-decoder architecture. I will walk you through the implementation of encoder-decoders in the transformers library, show you can use them for your projects, and give you a taste of what is coming in the next releases.

Hello 👾 Transformers

The transformer storm began with “Attention is all you need”, and the architecture proposed in the paper featured both an encoder and a decoder; it was originally aimed at translation, a Seq2Seq task. Its principal innovation compared to RNNs was to stack layers of bidirectional attention so every token can attend to every other token.

The original transformer architecture — that you have probably seen everywhere — has an encoder and decoder stack.

🚀 The rise of single-stack architectures

Following this, two papers came and further disrupted model architectures:

  1. GPT from OpenAI
  2. BERT from Google AI Language


The authors of GPT completely dropped the decoder of the original Transformer. They left us with this:

Our poor transformer cut in half.

The authors trained the model by teaching it a language model, the probability distribution of possible sequences, in an unsupervised way. They did so by factorizing the distribution in a particular way:

Which is mathematically trivially true: the probability of a sequence is the product of the probabilities of the tokens conditioned on the previous tokens. Note that this is not the only possible factorization, just one that seems particularly useful.

However, encoders are stacks of self-attention layers; everyone can attend to everyone and at the top of the encoder, the probability of each token will depend on every other token. How can the model learn the language model above?

The authors used a trick: the attention mask. Given a query Q, keys K and value V the output of (single-headed) attention layer reads:

Attention mechanism with masking. The mask specifies which positions the output can attend to by forcing the output of the softmax to 0 if the position cannot be attended to.

The idea is to add a matrix that will “forbid” tokens (say words) to attend to one another. The following mask is used in GPT to prevent tokens to attend to tokens later in the sequence:

Left-to-right mask. For a given token in the sequence, we assign a mask value of 0 for this token and the preceding ones; a value of minus infinity for the later ones. As a result tokens can only attend to tokens preceding them in the sequence.

Using this mask you can train the model by making each token in the sequence predict the next one, and you can generate new sequences in an auto-regressive way. While the generation abilities are nothing short of amazing, natural language understanding (NLU) is not GPT’s strong suit. That is where BERT entered the stage and took the NLP world by a storm.


BERT, unlike GPT, does not use any mask trick during pre-training. It is the pre-training task that pulls all the weight.

Instead of teaching the model to predict the next word in a sentence, it masks a fixed proportion of tokens at random in a sequence and trains the model to recover these masked words (this is a Cloze test used, among other things, to evaluate people’s abilities in a foreign language). This pre-trained model can then be fine-tuned on many language understanding tasks such as named entity recognition, question answering and text classification. BERT thus achieved a qualitative jump in many NLU benchmarks.

As the figure below shows, many of the papers that followed are iterations on the foundations laid by BERT and GPT:

  • A transformer encoder;
  • Various pre-training tasks and associated attention masks.
Not all models implement the Encoder-Decoder architecture; they are actually only becoming popular now. Transformer-XL, GPT2, XLNet and CTRL approximate a decoder stack during generation by using the hidden state of the previous state as the key & values of the attention module. Side note: all these ☝️ models are implemented in the transformers library or will be soon.

Yet every task cannot be reduced to solely a text generation task or a NLU task. Some tasks require both understanding and generation capabilities. For instance:

Me reaching the limits of my drawing skills.

In these situations, what we would like the model to learn is not only the probability of the generated sequence, but the probability of this sequence given another sequence:

Language model and Seq2Seq language models. Sometimes the distinction is pedantic, sometimes it’s not.

In a plot twist, the authors of XLM and UniLM managed to fit these two tasks in a single encoder. How? With a smart use of embeddings (XLM, for translation) or a clever mask trick (UniLM)!

The prefix mask as defined in the UniLM paper. Words in the first sequence can attend to any other word in this sequence; words in the second sequence can attend to every word in the first sequence and only the preceding words in their sequence.

👋 The comeback of Encoder-decoder architectures

So why should we care about Encoder-decoder architecture if one, smaller, architecture does the job very well? Can it even do what the smaller architecture does?

The authors of the T5 paper recently answered the last question with the affirmative; they even perform extremely well. Building on previous ideas, they proposed a scheme to map any natural language understanding task to a text-to-text task. (read the paper if you have time, you won’t regret it).

To answer the first question, I would say that there is one thing that might be much easier to do with encoder-decoders: transfer learning on every task that can be mapped to a translation task.

(note: these are speculations)

Say you have a pre-trained model in language A, a pre-trained model in language B. You could theoretically use one as the encoder, the other as the decoder and fine-tune the model on a translation task.

This is not only true for natural language. Take the example of a data scientist bored from having to write simple SQL queries whenever asked, and a boss who couldn’t care less about using a frontend to answer their own questions. They could pre-train BERT on SQL, use a pre-trained weights for the English languages, finetune on a year worth of requests. Et voilà!

Boss2SQL (patent pending). The encoder is a Bert model pre-trained on the English language (you can even use pre-trained weights!), the decoder a Bert model pre-trained on the SQL language. Fine-tune the model on year’s worth of requests and you will never have to write a single line of SQL again.

Now imagine if we had a bank of BERTs pre-trained in many, many languages. Writing translators would become much easier, and thanks to transfer learning this would make the whole translation business easier to scale.

Encoder-decoder architectures could theoretically allow us to compound pre-training efforts to do transfer learning on a vast number of translation tasks.

HuggingFace 🤗❤️ Seq2Seq

When I joined HuggingFace, my colleagues had the intuition that the transformers literature would go full circle and that encoder-decoders would make a comeback. We thought that we should anticipate this move, and allow researchers to easily implement such models with our library.

Well, everything moves fast in NLP these days: within a few weeks BART and T5 were published; both are encoder-decoder architectures showcasing all sorts of new state-of-the-art results.

Allowing the integration was fairly straightforward. All we needed to do was to modify the library to allow the existing models (encoders) to also act as decoders. Which meant:

  • Adding a cross-attention layer, whose weights will be randomly initialized;
  • Transforming the attention mask on the decoder input as a left-to-right mask adapted for generation tasks.
What happens schematically in our encoder-decoder architectures. The encoder has bi-directional layers of self attention; the decoder is in fact the same model to which we add layers of cross-attention and causal masks when it is used as a decoder. It allows us to leverage the models already implemented by the community with very little code.

🔧 Use encoder-decoder architectures to build amazing things🔧

We defined a simple API that allows you to initialize encoder-decoders with pre-trained encoders and decoders. We call these hybrid pre-trained architectures the combiners:

They allow you to combine, for instance, the NLU superpowers of BERT with the generation superpowers of GPT-2.

Thanks to transformers being central in the ecosystem and making state-of-the-art models available, encoder-decoder models benefit from a substantial compounding effect: 11 models implemented in the library means 121 possible combinations for you to start building cool things. When you account for all the different languages the numbers become astronomical.

The combiners are where the open-source philosophy of Hugging Face and its amazing community start to really shine.

Only need the superpowers of one model? No worries! We created a simpler API for you:

Knowing how to pass the arguments of the two models can be (the only) tricky (step), so here is a reference you can use for your implementation:

To pass keyword arguments to the encoder and the decoder you need to respectively prefix them with `encoder_` and `decoder_`. Keyword arguments that are not prefixed will be passed to both models.

We recognize there are situations (notably for finetuning) in which you want to randomly initialize either the encoder or decoder. Easy:

Initialize an encoder-decoder model with a pre-trained BERT encoder and a randomly initialized GPT2 XL

Finally, if you want to share the weights between the encoder and the decoder, you have access to both architecture via model.encoder and model.decoder. This is very application-specific, so we do not provide an API for this. Don’t hesitate to open an issue if you need help.

All this is all available since the 2.2.0 release of the transformers library. For the moment, only BERT has been adapted to work as a decoder, but we’re working our way through the other ones!

What combiner would you like most to play with? Let us know in the comments 👇 or ping us on Twitter @huggingface

⌨️ Generate text with Transformers ⌨️

When we started working on an illustrative example, we realized that the text generation capabilities of the libraries were limited (although we do have an awesome example script and an online demo of text generation). Since they are essential for Seq2Seq tasks, we started working on a simple module for you to generate sequences. The API is subject to change, but you should be able to generate text as in the following:

Sample sequences at various temperatures using k-filtering, nucleus sampling and applying repetition penalty.

It will include at the very least sampling for both single-stack (GPT, XLNet, CTRL, XLM, Transfo-XL, GPT2) and encoder-decoder stacks. The following example of transformers playing exquisite corpse was generated using an early version of this module. Look what 10 lines of code can do for you:

Transformers playing exquisite corps, a game invented by surrealists in the 1930s. Each algorithm is given the sequence written by the previous one, leading to an unexpected result.

Your GPU prefers beam search? We’ve got you covered:

And this is only scratching the surface of what is possible in text generation.

If you would like to see more state-of-the art methods to generate text in the library, let us know in the comments 👇 or ping us on Twitter @huggingface

📄 Abstractive summarization with Transformers 📄

Abstractive summarization has a attracted a lot of attention lately in the research literature. We have also had a substantial amount of feedback from the community. Users who are just curious about the current state-of-the-art but also practitioners who would be happy to use for it for their jobs.

We listened, so keep an eye on Twitter for the release 😉

At 🤗 HuggingFace we care deeply about the needs and aspiration of our community. What are the applications of Seq2Seq models that you find most interesting? Let us know in the comments 👇 or ping us on Twitter @huggingface
Source: huggingface

Encoder-decoders in Transformers: a hybrid pre-trained architecture for seq2seq