Skip to content

Transformer enter the stage

In today’s machine learning literature, there is no way around the transformer models of the paper “Attention is all you need” (Vaswani et al. (2017)). Especially in the field of Natural Language Processing, the transformer models described in this paper for the first time (e.g. GPT-2 or BERT) are indispensable. In this paper, we want to explain the key points of this much-cited paper and show the resulting innovations. 

What are Transformers?

“To the best of our knowledge, however, the Transformer is the first transduction model relying
entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution.”

Attention is all you need (Vaswani et al. (2017))

In understandable English, this means that the Transformer Model uses the so-called Self-Attention to find out for each word within a sentence the relationship to the other words in the same sentence. This does not require the use of Recurrent Neural Networks or Convolutional Neural Networks, as has been the case in the past. To understand why this is so extraordinary, we should first take a closer look at the areas in which transformers are used. 

Where are transformers used?

Transformers are currently used primarily for translation tasks, such as those at In addition, these models are also suitable for other use cases within Natural Language Processing (NLP), such as answering questions, text summarization, or text classification. The GPT-2 model is an implementation of transformers whose applications and results can be tried out here

Self-attention using the example of a translation

As we have already noted, the big novelty of the paper by Vaswani et al. (2017) was the use of the so-called self-attention mechanism for textual tasks. That this is a major component of the models as can be seen by looking at the general architecture of the transformers.

Dies ist ein methodischer Aufbau eines Transformer Modells. Es zeigt, dass ein Transformer aus einer Vielzahl von Self-Attention Schichten besteht.
Transformer – Model Architecture

What this mechanism actually does and why it is so much better than the previous approaches becomes clear in the following example. For this purpose, the following German sentence is to be translated into English with the help of machine learning: 

„Das Mädchen hat das Auto nicht gesehen, weil es zu müde war.“

Unfortunately, this task is not as easy for a computer as it is for us humans. The difficulty with this sentence is the little word “es”, which could theoretically stand for the girl or the car, although it is clear from the context that the girl is meant. And here’s the sticking point: context. How do we program an algorithm that understands the context of a sequence?

Before the publication of the paper “Attention is all you need”, so-called Recurrent Neural Networks were the state-of-the-art technology for such questions. These networks process word by word of a sentence. Until one arrives at the word “es”, all previous words must have been processed first. This leads to the fact that only a little information about the word “girl” is available in the network until the algorithm has arrived at the word “es” at all. The previous words “weil” and “gesehen” are at this time still clearly stronger in the consciousness of the algorithm. So there is the problem that dependencies within a sentence are lost if they are very far apart. 

What do Transformer models do differently? These algorithms process the complete sentence simultaneously and do not proceed word by word. As soon as the algorithm wants to translate the word “es” in our example, it first runs through the so-called self-attention layer. This helps the program to recognize other words within the sentence that could help to translate the word “es”. In our example, most of the words within the sentence will have a low value for attention and the word girl will have a high value. This way the context of the sentence is preserved in the translation. 

This is what you should take with you

  • Transformers enable new advances in the field of Natural Language Processing.
  • Transformers use so-called attention layers. This means that all words in a sequence are used for the task, no matter how far apart the words are in the sequence.
  • Transformers replace Recurrent Neural Networks for such tasks.

Other Articles on the Topic of Transformer Models

  • You can find the original paper here.
Das Logo zeigt einen weißen Hintergrund den Namen "Data Basecamp" mit blauer Schrift. Im rechten unteren Eck wird eine Bergsilhouette in Blau gezeigt.

Don't miss new articles!

We do not send spam! Read everything in our Privacy Policy.

Cookie Consent with Real Cookie Banner