iTranslated by AI
LLM Series: Understanding Transformer Architecture through the Attention Paper
Introduction
This article is a technical memo based on reading the following paper:
- Paper Title: Attention Is All You Need
- Paper Link: https://arxiv.org/abs/1706.03762
- Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
- Initial Publication Date: June 12, 2017
Detailed background explanations, illustrations, and supplementary information on Self-Attention and Multi-Head Attention are summarized on my personal blog.
👉 See the full version here: What is Transformer? A gentle explanation of the basic structure of LLMs from the Attention Is All You Need paper
3-Line Summary

- The Transformer is a model that performs sequence-to-sequence transformation centered on Attention, without using RNNs (neural networks that process sequences recursively) or CNNs (neural networks that look at local patterns via convolution).
- The paper combines Multi-Head Attention, Position-wise Feed-Forward Networks, and Positional Encoding, demonstrating high performance and learning efficiency in machine translation.
- While GPT-based LLMs are not the original paper itself, it is easy to understand the basic idea of vectorizing tokens, adding positional information, and creating contextual representations through Attention layers from here.
What Kind of Paper Is This?
"Attention Is All You Need" is the paper that proposed the Transformer.
The original Transformer in the paper features an Encoder-Decoder structure designed for machine translation.
The encoder reads the input sentence, and the decoder generates the translated sentence one token at a time.
| Component | Role |
|---|---|
| Input Embedding | Converts token IDs into vectors |
| Positional Encoding | Adds positional information to tokens |
| Encoder Self-Attention | Looks at relationships between tokens in the input sentence |
| Decoder Masked Self-Attention | Processes the output side to prevent looking at future tokens |
| Encoder-Decoder Attention | Allows the decoder to reference the encoder's output |
| Feed Forward Network | Performs non-linear transformations on the representation at each position |
Looking at the flow of the LLM series, the previous Bahdanau Attention focused on the idea of "where the decoder looks in the input."
In this Transformer paper, that Attention is placed at the center of sequence processing rather than acting as an auxiliary component.
What Is New?
The major breakthrough is the removal of recursion and convolution from sequence processing, shifting to an Attention-centered structure.
Since RNNs process tokens sequentially, parallelizing computation for long sequences is difficult.
While CNNs are relatively easier to parallelize, they require stacking layers to capture relationships between distant positions.
Transformer's Self-Attention (a mechanism where tokens in the same sequence reference each other) makes it easier to directly connect distant tokens within a single layer.
| Aspect | RNN | CNN | Transformer Self-Attention |
|---|---|---|---|
| Parallelization | Weak in the time direction | Relatively easy | Very easy |
| Long-range dependency | Longer paths for distant tokens | Dependent on the number of layers | Directly referenceable in one layer |
| Positional information | Included in sequential processing | Included in convolutional positions | Explicitly added via Positional Encoding |
Understanding Self-Attention

The basic form used in the Transformer paper is Scaled Dot-Product Attention.
Query, Key, and Value are created from the vector representations of input tokens using separate weight matrices.
For example, suppose we want to update the representation of "love" in the sentence "I love machine learning."
Here, the Query created from "love" represents "what I want to reference."
On the other hand, the Keys created from each token, such as "I," "machine," "learning," and "love," serve as "clues to be matched."
Tokens with a higher affinity between their Query and Key will have higher attention weights.
Finally, the Values of each token are summed according to those weights.
It is helpful to think of the Value as "the token's vector representation transformed by a Value-specific weight matrix."
| Role | Example of updating "love" |
|---|---|
| Query | Created from "love." What to reference |
| Key | Created from each token. Clues to be matched with the Query |
| Value | Created from each token's representation. Content to be mixed based on weights |
Understanding Multi-Head Attention

Multi-Head Attention is a mechanism that runs multiple Attention heads in parallel.
A head is a unit consisting of a separate set of projections to create Q/K/V and the Attention calculation using them.
It is somewhat similar to the types of filters in Convolutional Neural Networks, processing the same input from multiple perspectives.
However, while convolutional filters look at nearby elements locally, Attention heads can directly reference distant tokens as well.
In the base model of the paper,
In other words, each token is treated as a 512-dimensional representation, which is split into 8 heads to calculate Attention with 64 dimensions each.
The initial 512-dimensional vector is created by Input Embedding, added with Positional Encoding, and passed to the Transformer layer.
Technically Interesting Points
Positional Encoding is Essential
Self-Attention treats all tokens as a set.
Because it is difficult to distinguish word order as-is, the Transformer adds Positional Encoding (vectors representing position information) using sin/cos to the input embeddings.
While many modern LLMs use other methods like RoPE (Rotary Position Embedding, which represents relative positions via rotation), the premise that "it is difficult to incorporate order with Attention alone" remains crucial.
Feed Forward is Independent for Each Token

The Transformer's Feed Forward Network applies the same two-layer MLP (Multi-Layer Perceptron) to each token position.
The formula above resembles a convolution formula, but the linear transformation here is not a convolution.
It is easier to understand if you see the roles divided: Attention mixes information between tokens, and Feed Forward processes the representation of each token individually after it has been mixed.
Residual Connections and Layer Normalization are Also Critical
Although the Transformer is famous for the title "Attention Is All You Need," it is not made solely of Attention.
Around each sub-layer, there are residual connections (paths that add the input to the sub-layer output) and Layer Normalization (processing that normalizes the values of the feature dimensions).
It is natural to refer to the landmark paper that widely popularized residual connections, Deep Residual Learning for Image Recognition (ResNet).
How to Interpret the Experimental Results
In the paper, the authors evaluate the model on the WMT 2014 English-German and English-French translation tasks.
| Model | EN-DE BLEU | EN-FR BLEU | Features |
|---|---|---|---|
| ByteNet | 23.75 | - | Convolution-based sequence transformation model |
| Deep-Att + PosUnk | - | 39.2 | A strong translation model at the time |
| Transformer base | 27.3 | 38.1 | High performance with less training cost than existing models |
| Transformer big | 28.4 | 41.8 | Comparable to the state-of-the-art at the time for EN-DE |
Transformer big reported 28.4 BLEU for English-to-German and 41.8 BLEU for English-to-French.
However, these results were for mechanical translation tasks at that time.
The conversational performance and reasoning capabilities of modern LLMs cannot be explained solely by these BLEU scores.
Points of Interest from an Implementer's Perspective
Place masks before softmax
In the auto-regressive generation of the Decoder, a causal mask is required to prevent the model from seeing future tokens.
In implementation, we apply a mask to the scores before softmax so that the probability of positions that should not be seen becomes nearly zero.
scores = query @ key.transpose(-2, -1)
scores = scores / (head_dim**0.5)
scores = scores.masked_fill(~attention_mask, torch.finfo(scores.dtype).min)
weights = torch.softmax(scores, dim=-1)
Do not over-rely on attention weights as explanation
Attention weights are convenient for observing the tendency of which tokens were strongly referenced.
On the other hand, they do not provide a complete causal explanation of the model's judgment by themselves.
It is best to treat them as clues for debugging when used for visualization.
Computational complexity increases with token count
Because Self-Attention compares all token pairs, a simple implementation requires
This challenge leads to subsequent research such as FlashAttention, Sparse Attention, KV Cache, and GQA.
Personal Reflections
The Transformer paper does not explain all of modern LLMs.
GPT models center on the Decoder-only structure, and position information has evolved into methods like RoPE.
Nevertheless, it remains a very good entry point to understand the core flow of LLMs: "vectorize tokens, add positions, mix contexts with Attention, and process with Feed Forward."
In particular, I think the highlight is being able to view Self-Attention, Multi-Head Attention, Positional Encoding, and residual connections within the same diagram.
Detailed Version
More detailed backgrounds, diagrams, concrete examples of Q/K/V, and the differences between the Transformer and modern LLMs are summarized on my personal blog.
👉 Complete version here: What is a Transformer? An accessible explanation of the basic structure of LLMs from the Attention Is All You Need paper
Discussion