iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🦜

Hands-on with Transformer (3) — GPT-2-like

に公開

Purpose

I believe OpenAI's very famous research includes Improving Language Understanding
by Generative Pre-Training
and Language Models are Unsupervised Multitask Learners. These are what we call GPT(-1) and GPT-2. The reference implementation for GPT-2 can be found at gpt-2.

Additionally, there is a neural kana-kanji conversion engine called "Zenzai" that applies this GPT-2 to achieve something similar to macOS's "Live Conversion." (see Integrating the neural kana-kanji conversion engine "Zenzai" into azooKey on macOS)

Now, this is a continuation of Exploring Transformer (1) and Exploring Transformer (2) — Pseudo-ViT. Since "Zenzai" runs locally, I judged it to be a reasonably sized model and decided to boldly take a crack at GPT-2.

Vibe coding again

Once again, I'm asking GPT-4.1 to create training materials, as the MiniFormer from my previous post Exploring Transformer (1) should be reusable. Therefore, I provided the source code from that post as context to have it generated.

The reason for this decision is that when reading Improving Language Understanding
by Generative Pre-Training
, section 3.1 Unsupervised pre-training states:

In our experiments, we use a multi-layer Transformer decoder [34] for the language model, which is a variant of the transformer [62]. This model applies a multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens:

\begin{align*}h_0 &= U W_e + W_p \\ h_l &= \operatorname{transformer\_block}(h_{l−1}) \,\forall i \in [1, n] \\ P(u) &= \operatorname{softmax}(h_n W_e^T)\end{align*}\tag{2}

where U = (u_{−k}, \ldots , u_{−1}) is the context vector of tokens, n is the token embedding matrix, and W_p is the position embedding matrix.

So I thought using a Transformer decoder would work. Furthermore, reading the GPT-2 paper Language Models are Unsupervised Multitask Learners, section 2.3 Model says:

We use a Transformer (Vaswani et al., 2017) based architecture for our LMs. The model largely follows the details of the OpenAI GPT model (Radford et al., 2018) with a few modifications. Layer normalization (Ba et al., 2016) was moved to the input of each sub-block, similar to a pre-activation residual network (He et al., 2016) and an additional layer normalization was added after the final selfattention block. A modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of 1/\sqrt{N} where N is the number of residual layers. The vocabulary is expanded to 50,257. We also increase the context size from 512 to 1024 tokens and a larger batchsize of 512 is used.

Based on this, I felt that I just needed to change the position of layer normalization from GPT-1 and make other adjustments.

Taking these points into account, I placed an order for GPT-4.1 to build it, then had Gemini 2.5 Flash verify it and compare it with the reference implementation gpt-2 to teach me the differences.

What I made

  • Task content: Given the first few characters as a prompt, it fabricates and returns words that look like the names of animals, fruits, or vegetables.
  • The base Transformer implementation is MiniFormer from Exploring Transformer (1). Therefore, once again, prioritizing simplicity, I am using a single head and only 1 layer for Attention.
  • Using MiniGPT2, which is a significantly simplified version of the architecture from Language Models are Unsupervised Multitask Learners.
  • For the dataset, I used several hundred words consisting of animal, fruit, and vegetable names prepared in Exploring Transformer (1). Since there aren't strong trends or a massive amount of data, high generalization performance during model training cannot be strongly expected.

Implementation

Basically, everything except MiniGPT2 is the same as in Exploring Transformer (1). The changes are the parts where "★" is added in "# 5. MiniGPT2 main body (Pre-LN)" and the addition of the generate method. It is mostly MiniFormer. Some explanations regarding the "★" changes will be described later.

Setup and Dataset

!pip install -qU bertviz
from __future__ import annotations

import json
import random
import string
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np


SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
ANIMALS = [
  "cat",
  "caracal",
  "capybara",
  "canary",
  "cavy",
  "caiman",
  "cacomistle",
  "caribou",
  "cassowary",
  "caterpillar",
  "dog",
  ...
]  # Omitted as it is long

FRUITS_VEGGIES = [
  "apple",
  "apricot",
  "avocado",
  "artichoke",
  "banana",
  "bilberry",
  "blackberry",
  "blueberry",
  "boysenberry",
  "breadfruit",
  "cantaloupe",
  "casaba",
  ...
]  # Omitted as it is long


# 1. For dataset: Approximately 1000 types in total, including animal names + fruit/vegetable names
NAMES = ANIMALS + FRUITS_VEGGIES

# 2. Creating character vocabulary
ALL_CHARS = sorted(set("".join(NAMES)))
SPECIAL_TOKENS = ["<PAD>", "<BOS>", "<EOS>"]
ALL_TOKENS = SPECIAL_TOKENS + ALL_CHARS
VOCAB_SIZE = len(ALL_TOKENS)
CHAR2IDX = {ch: i for i, ch in enumerate(ALL_TOKENS)}
IDX2CHAR = {i: ch for ch, i in CHAR2IDX.items()}
PAD_IDX = CHAR2IDX["<PAD>"]
BOS_IDX = CHAR2IDX["<BOS>"]
EOS_IDX = CHAR2IDX["<EOS>"]
def encode_word(word, max_len):
    tokens = [BOS_IDX] + [CHAR2IDX[c] for c in word] + [EOS_IDX]
    tokens += [PAD_IDX] * (max_len - len(tokens))
    return tokens

def decode_tokens(tokens):
    chars = []
    for idx in tokens:
        if idx == EOS_IDX:
            break
        if idx >= len(IDX2CHAR):
            continue
        ch = IDX2CHAR[idx]
        if ch not in SPECIAL_TOKENS:
            chars.append(ch)
    return "".join(chars)

# 3. PyTorch Dataset
class NameDataset(Dataset):
    def __init__(self, words, max_len):
        self.max_len = max_len
        self.data = []
        for w in words:
            tokens = encode_word(w, max_len)
            self.data.append(tokens)
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        tokens = self.data[idx]
        x = torch.tensor(tokens[:-1], dtype=torch.long)
        y = torch.tensor(tokens[1:], dtype=torch.long)
        return x, y
# 4. Simple positional encoding
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-np.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        if d_model > 1:
            pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)
    def forward(self, x):
        # x + self.pe[:x.size(1)].unsqueeze(0)
        return x + self.pe[:x.size(1)]

# 5. MiniGPT2 main body (Pre-LN)
class MiniGPT2(nn.Module):
    def __init__(self, vocab_size, d_model=32, max_len=16):
        super().__init__()
        self.d_model = d_model
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pos_enc = PositionalEncoding(d_model, max_len)
        # Self-Attention (single head, 1 layer)
        # ★ No bias terms. Detailed reasons will be described later.
        self.q_linear = nn.Linear(d_model, d_model, bias=False)
        self.k_linear = nn.Linear(d_model, d_model, bias=False)
        self.v_linear = nn.Linear(d_model, d_model, bias=False)
        self.attn_out = nn.Linear(d_model, d_model, bias=False)
        # FFN
        # ★ No bias terms. Detailed reasons will be described later.
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model, bias=False),
            nn.ReLU(),
            nn.Linear(d_model, d_model, bias=False)
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.max_len = max_len
        self.attn_weights = None

    def forward(self, x, return_attn=False):
        emb = self.embed(x)
        emb = self.pos_enc(emb)
        # ★ Pre-LN structure: LN -> Attention -> Add
        attn_in = self.ln1(emb)
        Q = self.q_linear(attn_in)
        K = self.k_linear(attn_in)
        V = self.v_linear(attn_in)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / np.sqrt(self.d_model)
        # causal mask
        mask = torch.triu(torch.ones(scores.size(-2), scores.size(-1)), diagonal=1).bool().to(x.device)
        scores = scores.masked_fill(mask, float('-inf'))
        attn = torch.softmax(scores, dim=-1)
        attn_out = torch.matmul(attn, V)
        attn_out = self.attn_out(attn_out)
        # ★ self.ln1 has moved into the Pre-LN structure
        x1 = emb + attn_out
        # ★ Pre-LN -> FFN -> Add
        ffn_in = self.ln2(x1)
        x2 = x1 + self.ffn(ffn_in)
        # ★ weight tying
        logits = torch.matmul(x2, self.embed.weight.t())
        if return_attn:
            self.attn_weights = attn.detach().cpu().numpy()
            return logits, attn
        return logits

    def generate(self, start_tokens, eos_idx, pad_idx, max_gen=None, temperature=1.0):
        self.eval()
        max_gen = max_gen or self.max_len
        tokens = start_tokens.tolist()
        for _ in range(max_gen - len(tokens)):
            inp = torch.tensor(tokens, dtype=torch.long).unsqueeze(0).to(next(self.parameters()).device)
            logits = self.forward(inp)
            next_token_logits = logits[0, len(tokens)-1] / temperature
            probs = torch.softmax(next_token_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1).item()
            if next_token == eos_idx:
                break
            tokens.append(next_token)
        while len(tokens) < self.max_len:
            tokens.append(pad_idx)
        return tokens

Training Loop

# 6. Training loop
def train(model, loader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    for x, y in loader:
        x = x.to(device)
        y = y.to(device)
        optimizer.zero_grad()
        logits = model(x)
        loss = criterion(logits.view(-1, VOCAB_SIZE), y.view(-1))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(loader)

def evaluate(model, loader, criterion, device):
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for x, y in loader:
            x = x.to(device)
            y = y.to(device)
            logits = model(x)
            loss = criterion(logits.view(-1, VOCAB_SIZE), y.view(-1))
            total_loss += loss.item()
    return total_loss / len(loader)

Training

%%time

# Settings
max_word_len = max(len(w) for w in NAMES) + 2 # BOS, EOS
batch_size = 16
d_model = 32
n_epochs = 100
device = "cuda" if torch.cuda.is_available() else "cpu"

# Data splitting
random.seed(42)
random.shuffle(NAMES)
split = int(len(NAMES) * 0.8)
train_words = NAMES[:split]
test_words = NAMES[split:]

train_ds = NameDataset(train_words, max_word_len)
test_ds = NameDataset(test_words, max_word_len)
train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)

# Model
model = MiniGPT2(VOCAB_SIZE, d_model, max_word_len).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

# Training
for epoch in range(1, n_epochs+1):
    train_loss = train(model, train_loader, optimizer, criterion, device)
    test_loss = evaluate(model, test_loader, criterion, device)
    if epoch % 5 == 0:
        print(f"Epoch {epoch:2d}: train loss={train_loss:.4f}, test loss={test_loss:.4f}")

Epoch 5: train loss=3.2279, test loss=3.1276
...
Epoch 100: train loss=2.2263, test loss=2.4894
CPU times: user 18.7 s, sys: 710 ms, total: 19.4 s
Wall time: 21 s

Generation Examples

# 7. Word generation
def sample_generate(prompt, model, max_word_len, temperature=1.0):
    start_tokens = [BOS_IDX] + [CHAR2IDX[c] for c in prompt]
    start_tokens = torch.tensor(start_tokens, dtype=torch.long)
    out_tokens = model.generate(start_tokens, eos_idx=EOS_IDX, pad_idx=PAD_IDX, max_gen=max_word_len, temperature=temperature)
    return decode_tokens(out_tokens[1:])  # Exclude BOS

# Example: Generate something animal-like starting with "ca"
print("Generation example:", sample_generate("ca", model, max_word_len, temperature=0.8))

Generation example: categow

print("Generation example:", sample_generate("do", model, max_word_len, temperature=0.8))

Generation example: dock

print("Generation example:", sample_generate("app", model, max_word_len, temperature=0.8))

Generation example: appot

print("Generation example:", sample_generate("sh", model, max_word_len, temperature=0.8))

Generation example: shanelon

The results turned out like this. I have fine-tuned the model several times, and other examples obtained in that process included catelow, docket, appine, shoon, categothar, dotes, appinch, shabeer, categot, docketep, appincho, and she. While some are a bit questionable, there is a sense that "one can feel an atmosphere like some kind of name."

Explanation by GPT-4.1 and Gemini Flash 2.5

I will provide some excerpts. I had them compare it with the reference implementation gpt-2, specifically model.py. They pointed out things I would have never noticed on my own.

Main Alignment Points

  1. Basic Structure (Decoder-type Transformer, Autoregressive)
  • Both the official implementation and your implementation follow the flow: "Embedding → Positional Encoding → Multiple Blocks (Attention + FFN) → Output Layer."
  1. LayerNorm Position (Pre-LN)
  • In the official implementation's block() function, LayerNorm (norm(x, ...)) is applied before Attention and MLP (FFN) (Pre-LN).
  • Your MiniGPT2 has also been corrected to use Pre-LN.
  1. Residual Connections
  • Official: x = x + a (Attention output), x = x + m (MLP output).
  • MiniGPT2 is the same (x1 = emb + attn_out, x2 = x1 + ffn(...)).
  1. Causal Mask
  • Both the official and PyTorch implementations use triu for the upper triangular mask to ensure autoregression.
  1. Output Layer
  • Official implementation (openai/gpt-2)

    • After the final LayerNorm, the input embedding matrix wte (equivalent to self.embed.weight) is transposed and used for linear transformation (weight tying, weight sharing).
    • This provides a reduction in parameter count and training stability.
  • MiniGPT2 (This implementation)

    • After the final LayerNorm and FFN, logits are calculated with torch.matmul(x2, self.embed.weight.t()), sharing the embedding weights and output weights (weight tying).
    • Like the official implementation, weight tying has been correctly introduced, leading to a structure and behavior that is more "GPT-2-like."
  1. Disabling Bias in Q, K, V, attn_out, and FFN

Official implementation (from openai/gpt-2/src/model.py)

  • All linear layers for Q, K, V, attn_out, and FFN (MLP) are designed without bias terms.

    • In the official implementation, these linear transformations are implemented with conv1d(..., ...), and while bias terms are defined via b = tf.get_variable('b', [nf], initializer=tf.constant_initializer(0)), disabling bias terms is the de facto standard in the latest PyTorch/TensorFlow implementations due to the interaction with weight tying and LayerNorm.
    • OpenAI's PyTorch implementations (e.g., in the transformers library) are also unified with bias=False.
  • Specifically, bias is disabled (bias=False) in the following parts:

    • Q, K, V: Split of c_attn
    • Attention output: c_proj
    • FFN layers: c_fc, c_proj

Our MiniGPT2 implementation (PyTorch)

  • bias=False is explicitly set in each linear layer for Q, K, V, attn_out, and FFN (MLP).
  • This completely aligns with the design philosophy of the official GPT-2 implementation (no bias in Q, K, V, attn_out, or FFN).

Supplementary Notes

  • Biases exist in LayerNorm and Embedding, but this is also the case in the official implementation.
  • The main reasons for disabling bias include: "not needed in combination with LayerNorm," "reduction in parameter count," and "stabilizing training behavior."

Future Outlook and Application Examples

  • For other applications, according to GPT-4.1, it might be possible to repurpose the model for typo correction tasks by "using a word with an intentional one-character error as a prompt to see if the correct word can be generated." In essence, this would be a form of anomaly detection using a type of Autoencoder. It could be evaluated using the match rate between the correct answer and the model output, or the Levenshtein distance (edit distance).
  • While it might be a bit difficult given the simplicity of the model, another possibility is to replace the dataset with one consisting of character names to use it as an automatic character naming tool.

Summary

With this, I feel like I've finally gotten a bit of a feel for what GPT-2 is all about.

Looking back, this series of articles is what I would have wanted to read if there had been an "Extended Edition" of Deep Learning from Scratch 2. When I was reading "Deep Learning from Scratch 2," generative AI hadn't developed to this extent, and I felt that I couldn't properly get my hands on things like Transformers. However, in the current landscape, I've become able to receive guidance with the help of AI.

I wonder what I should do next.

References

GitHubで編集を提案

Discussion