Examining the Transformer Architecture

Part 1 & 2 of this blog was written by Nityesh Agarwal with collaboration and sponsorship of Exxact Corporation.

Part 1: The OpenAI GPT 2 Controversy

“Recycling is NOT good for the world. It is bad for the environment, it is bad for our health, and it is bad for our economy. I’m not kidding. Recycling is not good for the environment. It is destructive to the earth and it is a major contributor to global warming. Recycling is not good for our health. It contributes to obesity and diseases like heart disease and cancer. Recycling is bad for our economy. It increases the cost of a product, and in turn, the price of everything that is made with that product. Recycling is not good for our nation. We pay a tremendous price for the privilege of having the world’s most advanced and efficient recycling system. Recycling is a huge, colossal waste of time, energy, money, and resources.”— GPT 2 model from OpenAI

Yes, these aren’t the words of some anti-recycling freak trying to popularize anti-recycling manifesto over social media. This whole thing is written using a language model… an algorithm… an AI, if we are being crass.

GPT-2 is a generative model, created by OpenAI, trained on 40GB of Internet to predict the next word. And OpenAI found this model to be SO good that they did not release the fully trained model — which some say is against their founding motto of making AI open to all — due to their “concerns about malicious applications of the technology”.

What Malicious Applications Are We Talking About?

Open AI says, “We can also imagine the application of these models for malicious purposes, including the following (or other applications we can’t yet anticipate):

  • Generate misleading news articles
  • Impersonate others online
  • Automate the production of abusive or faked content to post on social media
  • Automate the production of spam/phishing content”

“But Is This Model REALLY That Good?”

Or are the above concerns merely the opinions of some paranoid people?

This model did not present some big leap on the algorithmic front. It is just a larger version of the GPT model that was published by the team months ago.

What it did do was show how capable our current language modeling techniques are in text generation. It is a massively scaled up version of its predecessor. GPT-2 has a whopping 1.5 billion parameters (10X more than the original GPT) and is trained on the text from 8 million websites.

You can understand the feat of this model once you compare it with other “popular” generative language models.

But First An Aside — A Very Simple Explainer On Language Models

Language models aim to represent the history of observed text succinctly in order to predict the next word. So, it is basically just learning to predict words. Give the model a prompt and it will predict the next word, then the next word, then the word after that, and pretty soon it will have formed a meaning sentence, combine enough of them and you have a coherent paragraph and then.. well pretty much anything you want.

For example, just watch this sci-fi short film released in mid-2016 whose script was created by a generative model using LSTM architecture trained on the scripts of a lot of sci-fi movies and tv shows:

They got Thomas Middleditch — Silicon Valley’s Richard — to star in this!!

Or how about this Harry Potter AI-generated fanfiction that shot to popularity in the end of 2017 and:

As you can see, both of these are much inferior in quality as compared to the GPT-2 example. Open AI has cherry-picked and published a few more samples on their blog post — Better Language Models and their Implications.

The “ unicorn” sample reads like a real science press release. The “ theft of nuclear material” sample reads like a real news story. The “ Miley Cyrus shoplifting” sample reads like a real post from a celebrity gossip site. The “ GPT-2” sample reads like a real OpenAI press release. The “ Legolas and Gimli” sample reads like a real fantasy novel. The “ Civil War homework assignment” reads like a real C-student’s paper. The “ JFK acceptance speech” reads like a real politician’s speech. The “ recycling “ sample reads like a real right-wing screed.

And it’s not just these 8 cherry-picked examples. Open AI also provides us with a dump of hundreds of raw GPT-2 samples, which can give us a clearer picture of the model’s capabilities.

Yes each of them “reads like” some real human generated content. But they aren’t.

This article argues that “ if you skim text, you miss obvious absurdities. The point is OpenAI HAS achieved the ability to pass the Turing test against humans on autopilot. “ So if you aren’t really concentrating and just skimming through, you won’t be able to spot that this was generated by a language model. Now, that is definitely not true for the other examples that I presented above. Even “reading like” normal human-generated content is a big feat.

So, yeah. I would say that this model is really good.

The Effect Of Not Releasing The Weights Of The Model

The fact that Open AI did not release the model came as a huge shock to the AI community and the media.

Some people argue that this was just a publicity stunt on part of Open AI because there was no algorithmic feat. And another group of people believes that this attempt would be futile since the code is open-sourced and big-companies/ people-willing-to-spend-enough-bucks-on-compute-resources would be able to replicate the results in just a few months’ time.

But there is another group of people who are applauding Open AI for trying to create awareness among the researchers about the effects of the results of their research. As AI techniques become more and more powerful, the world will face an important challenge to fight synthetic content and misinformation.

Soumith Chintala is the creator of PyTorch. This is a thread between him, Jack Clark and Jeremy Howard!

So wouldn’t it be cool to know how it works, to know the algorithm that powers it?

Part 2: A Brief Description of How Transformers Work

The Transformer Architecture

This architecture was first proposed in the seminal paper — Attention is all you need from Google in the mid 2017.
Since that short amount of time, this architecture has been used in producing state-of-the-art results in 2 papers — One being GPT/GPT-2 and the other was BERT.

Table Source: Language Models are Unsupervised Multitask Learners, Radford et al. 2019

The smallest one corresponds to the GPT model; the second smallest one is equivalent to the largest model in BERT; the largest one, which is more than an order of magnitude larger, corresponds to the GPT-2 model

Now let’s look at the architecture:

The Transformer architecture as present in the Attention is all you need paper by Google

The first thing that we can see is that it has a sequence-to-sequence encoder-decoder architecture. Much of the literature on Transformers that is present on the Internet uses this very architecture to explain Transformers. But this is not the one used in Open AI’s GPT model (or the GPT-2 model, which was just a larger version of its predecessor).

The GPT is a 12-layer decoder only transformer with 117M parameters.

Improving Language Understanding by Generative Pre-Training, Radford et al.

The Transformer architecture used in the GPT paper from Open AI

GPT (and the smaller released version of GPT-2) have 12 layers of transformers, each with 12 independent attention mechanisms, called “heads”; the result is 12 x 12 = 144 distinct attention patterns. Each of these attention patterns corresponds to a linguistic property captured by the model.

As we can see in the above transformer architectures, attention is an important part of the Transformer. In fact, that would be an understatement. Attention is what makes the transformer work. So, let’s get an brief introduction to attention.

Attention Model

RNN units would encode the input up until timestamp t into one hidden vector ht which would then be passed to the next timestamp (or to the decoder in case of a sequence-to-sequence model). With an attention mechanism we no longer try encode the full source sentence into a fixed-length vector. Rather, we allow the decoder to “attend” to different parts of the source sentence at each step of the output generation. Importantly, we let the model learnwhat to attend to based on the input sentence and what it has produced so far.

So let’s say we wanted to translate “L’ accord sur la zone économique européenne a été signé en aout 1992.” (French) to English which is “The agreement on the European Economic Area was signed in August 1992.”

The graph below shows what an Attention model learned to attend to for each word of translation that it generated.

Image Source: jalammar.github.io

Notice how it is mostly just linear except when translating “zone économique européenne” to “European economic zone”. It correctly attends in the reverse order in that case.

Such an ability allows attention to learn long range dependencies.

Comparison with RNN

As mentioned before, some practitioners believe that we are now witnessing the fall of RNN/LSTM. Since their introduction in the year 2014, they have been the default go-to architecture for all the NLP tasks ranging from language modelling, machine translation, text summarization, image/video captioning, speech to text conversion and more.

But RNN and its variations had 2 major shortcomings:

  1. Failure to remember long-range dependencies

One of the primary appeals of RNNs is that they are able to use their reasoning about previous events in the film to inform later ones. But this also turns out to be one of their major shortcomings.

RNNs need to encode the information from the entire sequence in one single context vector. Source

The decoder is supposed to generate a translation solely based on the last hidden state (h3) from the encoder. This vector must encode everything we need to know about the source sentence. It must fully capture its meaning.

As the gap between 2 words grows, the RNN seems to “forget” the previous words.

Long Short Term Memory units (LSTMs) and Gated Recurrent Units (GRUs) provide a hackish solution to this problem by using memory unit(/s) controlled by gates which allow them to fetch information from an earlier past.

  1. Inability to harness the power of GPUs

RNNs aren’t able to process inputs in parallel. They are networks with loops in them, allowing information to persist.

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

In the above diagram, a chunk of neural network, A, looks at some input xt and outputs a value ht. A loop allows information to be passed from one step of the network to the next. This means that they can only handle one input unit at a time.

That is why they can’t make use of the immensely powerful GPUs’ parallel computing ability. Such GPUs have allowed CNNs to train on a HUGE amount of data and grow to absolutely massive sizes. RNNS or
LSTMs or any of their variants are inherently inept at leveraging this means.

Transformers excel at both of these tasks.

Deep Learning Workstations from Exxact featuring Quadro RTX 8000’s are perfectly suited to train even large transformer models. Each Quadro RTX 8000 has 48 GB GPU memory, and a pair can be connected with NVLink to give 96 GB total GPU Memory to Fit massive transformer models.

Pre-Trained Language Models — Transfer Learning in NLP

The Transformer architecture allows the creation of NLP models trained on absolutely huge datasets as we saw in this article. Such models are not feasible to be trained by everyone, just as you wouldn’t expect to train a VGG Net from scratch on the ImageNet dataset. Hence, comes the era of pre-trained language models.

The weights learned by such massive pre-trained models can later be reused for specific tasks by fine-tuning them to the specific dataset. This would allow us to do transfer learning by capturing the lower-level intricacies of the language and simply “plugging” it to suit our specific task.

Transformers present the next front in NLP. In less than a couple of years since its introduction, this new architectural trend has surpassed the feats of RNN-based architectures. This exciting pace of invention is perhaps the best part of being early to a new field like Deep Learning!

Part 3: Training a Transformer Network from Scratch in Docker

Training for this tutorial will be done on our Exxact Valence Workstation using a NVIDIA RTX 2080 Ti. Furthermore, we will create an English to German translator using the transformer model implementation located here on the official TensorFlow GitHub. Assuming you have all the necessary dependencies met for TensorFlow GPU, we provide a simple tutorial guide for getting started with transformers in docker.

Step 1) Launch TensorFlow GPU Docker Container

Using Docker allows us to spin up a fully contained environment for our training needs. We always recommend using Docker, as it allows ultimate flexibility (and forgiveness) in our training environment. To begin we will open a terminal window and enter the following command to launch our NVIDIA CUDA powered container.

nvidia-docker run -it -p 6007:6006 -v /data:/datasets tensorflow/tensorflow:nightly-gpu bash

Note: A quick description about the key parameters of the above command (if you’re unfamiliar with Docker).

Step 2) Install git

This may be necessary if you are running a fresh docker container.

apt-get install git

Step 3) Download TensorFlow Models

In case you do not have the latest up-to-date codebase for the models, the transformer is included here and they tend to update quite frequently.

git clone https://github.com/tensorflow/models.git

Step 4) Install Requirements

As a necessary step, this will install the python package requirements for training TensorFlow models.

pip install --user -r official/requirements.txt

Step 5) Export Pythonpath

Export PYTHONPATH to the folder where the models folder are located on your machine. The command below references where the models are located on our system. Be sure to replace the ‘/datasets/models‘ syntax with the data path to the folder where you stored/downloaded your models.

export PYTHONPATH="$PYTHONPATH:/datasets/datasets/models"

Step 6) Download and Preprocess the Dataset

The data_download.py command will download and preprocess the training and evaluation WMT datasets. Upon download and extraction, the training data is used to generate for what we will use as VOCAB_FILE variables. Effectively, the eval and training strings are tokenized, and the results are processed and saved as TFRecords.

NOTE: (per the official requirements): 1.75GB of compressed data will be downloaded. In total, the raw files (compressed, extracted, and combined files) take up 8.4GB of disk space. The resulting TFRecord and vocabulary files are 722MB. The script takes around 40 minutes to run, with the bulk of the time spent downloading and ~15 minutes spent on preprocessing.

python data_download.py --data_dir=/datasets/datasets/transformer

Step 7) Set Training Variables


This specifies what model to train. ‘big’ or ‘base’

IMPORTANT NOTE: The ‘big’ model will not work on most consumer grade GPU’s such as RTX 2080 Ti, GTX 1080 Ti. If you need to train the ‘big’ model we recommend a system with at least 48 available GB GPU memory such as a Data Science Workstation equipped with the Quadro RTX 8000’s, or 2 x Qudaro RTX 6000 with NVLink. Alternatively a TITAN RTX Workstation with 2x TITAN RTX (With NVLink Bridge) should also suffice. For this example, we’re using an RTX 2080 Ti, so we select ‘base‘.



This variable should be set to where the training data is located.



This variable specifies the model location based on what model is specified in the ‘PARAM_SET’ variable



This variable expresses where the location of the preprocessed vocab files are located.


‘EXPORT_DIR’ Export trained model

This will specify the location when/where you export the model in Tensorflow SavedModel format. This is done when using the flag export_dir when training in step 8.


Step 8) Train the Transformer Network

The following command ‘python transformer_main.py’ will train the transformer for a total of 260,000 steps. See how the flags are set up to reference the variables you set in the previous steps. You can train for less than 260,000 steps, it’s up to you.

NOTE: This will take a long time to train depending on your GPU resources. The official TensorFlow transformer model is under constant development, be sure to check periodically on their github for any optimizations and techniques to reduce training times.

python transformer_main.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR --vocab_file=$VOCAB_FILE --param_set=$PARAM_SET --bleu_source=test_data/newstest2014.en --bleu_ref=test_data/newstest2014.de --train_steps=260000 --steps_between_evals=1000 --export_dir=$EXPORT_DIR

Step 9) View Results in Tensorboard

As we noted earlier, we can check the status of training in the Tensorboard GUI. To check in real time, run the following command in a separate terminal (or TensorFlow container), and type localhost:6007 in your browser to view Tensorboard. You can also wait until training is complete to use the current container.

tensorboard --logdir=$MODEL_DIR

You should see some outputs of the training similar to below.

Step 10) Test the Trained Model (Translate English to German)

Now we’ve trained our network, let’s enjoy the fruits of our labor using translate.py! In the command below, replace the text “hello world” with desired text to translate

python translate.py --model_dir=$MODEL_DIR --vocab_file=$VOCAB_FILE \ --param_set=$PARAM_SET --text="hello world"

Output of above command:

I0411 18:05:23.619654 139653733598976 translate.py:150] Translation of “hello world”: “Hallo Welt”

Final Thoughts

We’ve taken a look at transformer networks, how and why they are so effective. Currently the state of the art architecture, this area is an active area of NLP research. You should also now have a general idea of what it takes to train a transformer network For a deeper dive into training transformers visit the official transformer implementation in the TensorFlow github repo. We hope you’ve enjoyed this blog series, now get out there and build something awesome!

Originally published at https://blog.exxactcorp.com on May 29, 2019.

Examining the Transformer Architecture — Part 1: The OpenAI GPT 2 Controversy was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.