iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🤖

LLM Series: Understanding Transformer Architecture through the Attention Paper

に公開

Introduction

This article is a technical memo based on reading the following paper:

  • Paper Title: Attention Is All You Need
  • Paper Link: https://arxiv.org/abs/1706.03762
  • Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
  • Initial Publication Date: June 12, 2017

Detailed background explanations, illustrations, and supplementary information on Self-Attention and Multi-Head Attention are summarized on my personal blog.

👉 See the full version here: What is Transformer? A gentle explanation of the basic structure of LLMs from the Attention Is All You Need paper

3-Line Summary

Structure where Transformer Encoder and Decoder stack Multi-Head Attention and Feed Forward

  • The Transformer is a model that performs sequence-to-sequence transformation centered on Attention, without using RNNs (neural networks that process sequences recursively) or CNNs (neural networks that look at local patterns via convolution).
  • The paper combines Multi-Head Attention, Position-wise Feed-Forward Networks, and Positional Encoding, demonstrating high performance and learning efficiency in machine translation.
  • While GPT-based LLMs are not the original paper itself, it is easy to understand the basic idea of vectorizing tokens, adding positional information, and creating contextual representations through Attention layers from here.

What Kind of Paper Is This?

"Attention Is All You Need" is the paper that proposed the Transformer.

The original Transformer in the paper features an Encoder-Decoder structure designed for machine translation.

The encoder reads the input sentence, and the decoder generates the translated sentence one token at a time.

Component Role
Input Embedding Converts token IDs into vectors
Positional Encoding Adds positional information to tokens
Encoder Self-Attention Looks at relationships between tokens in the input sentence
Decoder Masked Self-Attention Processes the output side to prevent looking at future tokens
Encoder-Decoder Attention Allows the decoder to reference the encoder's output
Feed Forward Network Performs non-linear transformations on the representation at each position

Looking at the flow of the LLM series, the previous Bahdanau Attention focused on the idea of "where the decoder looks in the input."

In this Transformer paper, that Attention is placed at the center of sequence processing rather than acting as an auxiliary component.

What Is New?

The major breakthrough is the removal of recursion and convolution from sequence processing, shifting to an Attention-centered structure.

Since RNNs process tokens sequentially, parallelizing computation for long sequences is difficult.

While CNNs are relatively easier to parallelize, they require stacking layers to capture relationships between distant positions.

Transformer's Self-Attention (a mechanism where tokens in the same sequence reference each other) makes it easier to directly connect distant tokens within a single layer.

Aspect RNN CNN Transformer Self-Attention
Parallelization Weak in the time direction Relatively easy Very easy
Long-range dependency Longer paths for distant tokens Dependent on the number of layers Directly referenceable in one layer
Positional information Included in sequential processing Included in convolutional positions Explicitly added via Positional Encoding

Understanding Self-Attention

Self-Attention flow where relationships between tokens in a sentence are calculated as weights

The basic form used in the Transformer paper is Scaled Dot-Product Attention.

\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Query, Key, and Value are created from the vector representations of input tokens using separate weight matrices.

For example, suppose we want to update the representation of "love" in the sentence "I love machine learning."

Here, the Query created from "love" represents "what I want to reference."

On the other hand, the Keys created from each token, such as "I," "machine," "learning," and "love," serve as "clues to be matched."

Tokens with a higher affinity between their Query and Key will have higher attention weights.

Finally, the Values of each token are summed according to those weights.

It is helpful to think of the Value as "the token's vector representation transformed by a Value-specific weight matrix."

Role Example of updating "love"
Query Created from "love." What to reference
Key Created from each token. Clues to be matched with the Query
Value Created from each token's representation. Content to be mixed based on weights

Understanding Multi-Head Attention

Multi-Head Attention flow where different relationships are viewed in parallel using multiple heads

Multi-Head Attention is a mechanism that runs multiple Attention heads in parallel.

A head is a unit consisting of a separate set of projections to create Q/K/V and the Attention calculation using them.

It is somewhat similar to the types of filters in Convolutional Neural Networks, processing the same input from multiple perspectives.

However, while convolutional filters look at nearby elements locally, Attention heads can directly reference distant tokens as well.

In the base model of the paper, d_\mathrm{model}=512 and the number of heads is 8.

In other words, each token is treated as a 512-dimensional representation, which is split into 8 heads to calculate Attention with 64 dimensions each.

The initial 512-dimensional vector is created by Input Embedding, added with Positional Encoding, and passed to the Transformer layer.

Technically Interesting Points

Positional Encoding is Essential

Self-Attention treats all tokens as a set.

Because it is difficult to distinguish word order as-is, the Transformer adds Positional Encoding (vectors representing position information) using sin/cos to the input embeddings.

PE_{(pos, 2i)} = \sin\left(pos / 10000^{2i / d_\mathrm{model}}\right)
PE_{(pos, 2i+1)} = \cos\left(pos / 10000^{2i / d_\mathrm{model}}\right)

While many modern LLMs use other methods like RoPE (Rotary Position Embedding, which represents relative positions via rotation), the premise that "it is difficult to incorporate order with Attention alone" remains crucial.

Feed Forward is Independent for Each Token

Feed Forward Network flow applying the same non-linear transformation at each token position

The Transformer's Feed Forward Network applies the same two-layer MLP (Multi-Layer Perceptron) to each token position.

\mathrm{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

The formula above resembles a convolution formula, but the linear transformation here is not a convolution.

It is easier to understand if you see the roles divided: Attention mixes information between tokens, and Feed Forward processes the representation of each token individually after it has been mixed.

Residual Connections and Layer Normalization are Also Critical

Although the Transformer is famous for the title "Attention Is All You Need," it is not made solely of Attention.

Around each sub-layer, there are residual connections (paths that add the input to the sub-layer output) and Layer Normalization (processing that normalizes the values of the feature dimensions).

It is natural to refer to the landmark paper that widely popularized residual connections, Deep Residual Learning for Image Recognition (ResNet).

How to Interpret the Experimental Results

In the paper, the authors evaluate the model on the WMT 2014 English-German and English-French translation tasks.

Model EN-DE BLEU EN-FR BLEU Features
ByteNet 23.75 - Convolution-based sequence transformation model
Deep-Att + PosUnk - 39.2 A strong translation model at the time
Transformer base 27.3 38.1 High performance with less training cost than existing models
Transformer big 28.4 41.8 Comparable to the state-of-the-art at the time for EN-DE

Transformer big reported 28.4 BLEU for English-to-German and 41.8 BLEU for English-to-French.

However, these results were for mechanical translation tasks at that time.

The conversational performance and reasoning capabilities of modern LLMs cannot be explained solely by these BLEU scores.

Points of Interest from an Implementer's Perspective

Place masks before softmax

In the auto-regressive generation of the Decoder, a causal mask is required to prevent the model from seeing future tokens.

In implementation, we apply a mask to the scores before softmax so that the probability of positions that should not be seen becomes nearly zero.

scores = query @ key.transpose(-2, -1)
scores = scores / (head_dim**0.5)
scores = scores.masked_fill(~attention_mask, torch.finfo(scores.dtype).min)
weights = torch.softmax(scores, dim=-1)

Do not over-rely on attention weights as explanation

Attention weights are convenient for observing the tendency of which tokens were strongly referenced.

On the other hand, they do not provide a complete causal explanation of the model's judgment by themselves.

It is best to treat them as clues for debugging when used for visualization.

Computational complexity increases with token count

Because Self-Attention compares all token pairs, a simple implementation requires O(n^2) computation and memory for n tokens.

This challenge leads to subsequent research such as FlashAttention, Sparse Attention, KV Cache, and GQA.

Personal Reflections

The Transformer paper does not explain all of modern LLMs.

GPT models center on the Decoder-only structure, and position information has evolved into methods like RoPE.

Nevertheless, it remains a very good entry point to understand the core flow of LLMs: "vectorize tokens, add positions, mix contexts with Attention, and process with Feed Forward."

In particular, I think the highlight is being able to view Self-Attention, Multi-Head Attention, Positional Encoding, and residual connections within the same diagram.

Detailed Version

More detailed backgrounds, diagrams, concrete examples of Q/K/V, and the differences between the Transformer and modern LLMs are summarized on my personal blog.

👉 Complete version here: What is a Transformer? An accessible explanation of the basic structure of LLMs from the Attention Is All You Need paper

Discussion