In today’s machine learning literature, there is no way around the transformer models of the paper “Attention is all you need” (Vaswani et al. (2017)). Especially in the field of Natural Language Processing, the transformer models described in this paper for the first time (e.g. GPT-2 or BERT) are indispensable. In this paper, we want to explain the key points of this much-cited paper and show the resulting innovations.
What are Transformers?
“To the best of our knowledge, however, the Transformer is the first transduction model relying
entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution.”
Attention is all you need (Vaswani et al. (2017))
In understandable English, this means that the Transformer Model uses the so-called Self-Attention to find out for each word within a sentence the relationship to the other words in the same sentence. This does not require the use of Recurrent Neural Networks or Convolutional Neural Networks, as has been the case in the past. To understand why this is so extraordinary, we should first take a closer look at the areas in which transformers are used.
Where are transformers used?
Transformers are currently used primarily for translation tasks, such as those at www.deepl.com. In addition, these models are also suitable for other use cases within Natural Language Processing (NLP), such as answering questions, text summarization, or text classification. The GPT-2 model is an implementation of transformers whose applications and results can be tried out here.
Self-attention using the example of a translation
As we have already noted, the big novelty of the paper by Vaswani et al. (2017) was the use of the so-called self-attention mechanism for textual tasks. That this is a major component of the models as can be seen by looking at the general architecture of the transformers.

What this mechanism actually does and why it is so much better than the previous approaches becomes clear in the following example. For this purpose, the following German sentence is to be translated into English with the help of machine learning:
„Das Mädchen hat das Auto nicht gesehen, weil es zu müde war.“
Unfortunately, this task is not as easy for a computer as it is for us humans. The difficulty with this sentence is the little word “es”, which could theoretically stand for the girl or the car, although it is clear from the context that the girl is meant. And here’s the sticking point: context. How do we program an algorithm that understands the context of a sequence?
Before the publication of the paper “Attention is all you need”, so-called Recurrent Neural Networks were the state-of-the-art technology for such questions. These networks process word by word of a sentence. Until one arrives at the word “es”, all previous words must have been processed first. This leads to the fact that only a little information about the word “girl” is available in the network until the algorithm has arrived at the word “es” at all. The previous words “weil” and “gesehen” are at this time still clearly stronger in the consciousness of the algorithm. So there is the problem that dependencies within a sentence are lost if they are very far apart.
What do Transformer models do differently? These algorithms process the complete sentence simultaneously and do not proceed word by word. As soon as the algorithm wants to translate the word “es” in our example, it first runs through the so-called self-attention layer. This helps the program to recognize other words within the sentence that could help to translate the word “es”. In our example, most of the words within the sentence will have a low value for attention, and the word girl will have a high value. This way the context of the sentence is preserved in the translation.
What are the different types of Transformers?
Transformers are a type of Machine Learning model that have gained popularity in recent years, particularly in the field of natural language processing. There are several types that are commonly used in Machine Learning, each with its own strengths and weaknesses.
- Transformer Model: The Transformer model, also known as the Transformer encoder, is a type of transformer that was introduced in the seminal paper “Attention is All You Need” by Vaswani et al. (2017). The model is designed to process sequential data, such as text or time-series data, using a self-attention mechanism that allows it to focus on different parts of the input sequence. It has achieved state-of-the-art results in a number of natural language processing tasks, including machine translation and sentiment analysis.
- BERT: The BERT (Bidirectional Encoder Representations from Transformers) model was introduced in the paper “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” by Devlin et al. (2018). The BERT model is pre-trained on large amounts of text data using a masked language modeling objective, which allows it to learn contextual representations of words and sentences. It has achieved impressive results on a range of natural language processing tasks, including question-answering and natural language inference.
- GPT: The GPT (Generative Pre-trained Transformer) model is another type that was introduced in the paper “Improving Language Understanding by Generative Pre-Training” by Radford et al. (2018). The GPT model is also pre-trained on large amounts of text data but uses a generative language modeling objective to learn to generate coherent text. It has achieved impressive results on a range of natural language generation tasks, such as language translation and text completion.
- XLNet: The XLNet model is a transformer-based language model that was introduced in the paper “XLNet: Generalized Autoregressive Pretraining for Language Understanding” by Yang et al. (2019). XLNet uses a permutation-based pre-training method that allows it to model all possible orders of the input sequence. This method has been shown to outperform BERT on a range of natural language processing tasks.
- T5: The T5 (Text-to-Text Transfer Transformer) model is a transformer-based language model that was introduced in the paper “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” by Raffel et al. (2019). T5 is a versatile model that can be fine-tuned on a range of natural language processing tasks by simply providing the task as input and output pairs. T5 has achieved state-of-the-art results on a range of natural language processing tasks, including summarization, machine translation, and question-answering.
Overall, the different types each have their own strengths and weaknesses and are suited to different types of Machine Learning tasks. Researchers continue to explore new applications and improvements to existing models, with the goal of improving the accuracy and efficiency of these powerful architectures.
LSTM and RNN vs. Transformer
Artificial intelligence is currently very short-lived, which means that new findings are sometimes very quickly outdated and improved. Just as LSTM has eliminated the weaknesses of Recurrent Neural Networks, so-called Transformer Models can deliver even better results than LSTM.
The transformers differ fundamentally from previous models in that they do not process texts word for word, but consider entire sections as a whole. Thus they have clear advantages to understand contexts better. Thus, the problems of short and long-term memory, which were partially solved by LSTMs, are no longer present, because if the sentence is considered as a whole anyway, there are no problems that dependencies could be forgotten.

In addition, transformers are bidirectional in computation, which means that when processing words, they can also include the immediately following and previous words in the computation. Classical RNN or LSTM models cannot do this, since they work sequentially and thus only preceding words are part of the computation. This disadvantage was tried to avoid with so-called bidirectional RNNs, however, these are clearly more computationally expensive than transformers.
However, the bidirectional Recurrent Neural Networks still have small advantages over the transformers because the information is stored in so-called self-attention layers. With every token more to be recorded, this layer becomes harder to compute and thus increases the required computing power. This increase in effort, on the other hand, does not exist to this extent in bidirectional RNNs.
What are the limitations of a Transformer?
Transformer models have been successful in a wide range of natural language processing tasks. However, they also have limitations. Firstly, these models require large amounts of training data, making it difficult in situations where data is scarce or expensive to obtain. Secondly, these models can be computationally intensive, particularly when fine-tuning them on specific tasks, leading to difficulties in training large models on standard hardware.
Thirdly, transformer models can be challenging to interpret, especially when used for complex tasks such as natural language generation. As a result, it can be difficult to understand how the model is making its predictions or to identify sources of errors or biases. Fourthly, while transformer models have achieved impressive results on a range of natural language processing tasks, there is still a need for models that can generalize to new domains or languages. Current models may struggle to adapt to new data distributions or to handle languages with limited training data.
Finally, transformer models can be vulnerable to biases in the training data, leading to biased predictions. This can raise concerns about fairness and equity. Therefore, there is a need for techniques to identify and mitigate bias in transformer models to ensure fair and equitable outcomes.
In summary, while transformer models have shown significant advances in natural language processing, it is important to consider their limitations and develop new techniques to address these challenges.
This is what you should take with you
- Transformers enable new advances in the field of Natural Language Processing.
- Transformers use so-called attention layers. This means that all words in a sequence are used for the task, no matter how far apart the words are in the sequence.
- They replace Recurrent Neural Networks for such tasks.
Other Articles on the Topic of Transformer Models
- You can find the original paper here.