iTranslated by AI
Hands-on with Transformer (3) — GPT-2-like
Purpose
I believe OpenAI's very famous research includes Improving Language Understanding
by Generative Pre-Training and Language Models are Unsupervised Multitask Learners. These are what we call GPT(-1) and GPT-2. The reference implementation for GPT-2 can be found at gpt-2.
Additionally, there is a neural kana-kanji conversion engine called "Zenzai" that applies this GPT-2 to achieve something similar to macOS's "Live Conversion." (see Integrating the neural kana-kanji conversion engine "Zenzai" into azooKey on macOS)
Now, this is a continuation of Exploring Transformer (1) and Exploring Transformer (2) — Pseudo-ViT. Since "Zenzai" runs locally, I judged it to be a reasonably sized model and decided to boldly take a crack at GPT-2.
Vibe coding again
Once again, I'm asking GPT-4.1 to create training materials, as the MiniFormer from my previous post Exploring Transformer (1) should be reusable. Therefore, I provided the source code from that post as context to have it generated.
The reason for this decision is that when reading Improving Language Understanding
by Generative Pre-Training, section 3.1 Unsupervised pre-training states:
In our experiments, we use a multi-layer Transformer decoder [34] for the language model, which is a variant of the transformer [62]. This model applies a multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens:
\begin{align*}h_0 &= U W_e + W_p \\ h_l &= \operatorname{transformer\_block}(h_{l−1}) \,\forall i \in [1, n] \\ P(u) &= \operatorname{softmax}(h_n W_e^T)\end{align*}\tag{2}
whereis the context vector of tokens, U = (u_{−k}, \ldots , u_{−1}) is the token embedding matrix, and n is the position embedding matrix. W_p
So I thought using a Transformer decoder would work. Furthermore, reading the GPT-2 paper Language Models are Unsupervised Multitask Learners, section 2.3 Model says:
We use a Transformer (Vaswani et al., 2017) based architecture for our LMs. The model largely follows the details of the OpenAI GPT model (Radford et al., 2018) with a few modifications. Layer normalization (Ba et al., 2016) was moved to the input of each sub-block, similar to a pre-activation residual network (He et al., 2016) and an additional layer normalization was added after the final selfattention block. A modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of
where 1/\sqrt{N} is the number of residual layers. The vocabulary is expanded to 50,257. We also increase the context size from 512 to 1024 tokens and a larger batchsize of 512 is used. N
Based on this, I felt that I just needed to change the position of layer normalization from GPT-1 and make other adjustments.
Taking these points into account, I placed an order for GPT-4.1 to build it, then had Gemini 2.5 Flash verify it and compare it with the reference implementation gpt-2 to teach me the differences.
What I made
- Task content: Given the first few characters as a prompt, it fabricates and returns words that look like the names of animals, fruits, or vegetables.
- The base Transformer implementation is
MiniFormerfrom Exploring Transformer (1). Therefore, once again, prioritizing simplicity, I am using a single head and only 1 layer for Attention. - Using
MiniGPT2, which is a significantly simplified version of the architecture from Language Models are Unsupervised Multitask Learners. - For the dataset, I used several hundred words consisting of animal, fruit, and vegetable names prepared in Exploring Transformer (1). Since there aren't strong trends or a massive amount of data, high generalization performance during model training cannot be strongly expected.
Implementation
Basically, everything except MiniGPT2 is the same as in Exploring Transformer (1). The changes are the parts where "★" is added in "# 5. MiniGPT2 main body (Pre-LN)" and the addition of the generate method. It is mostly MiniFormer. Some explanations regarding the "★" changes will be described later.
Setup and Dataset
!pip install -qU bertviz
from __future__ import annotations
import json
import random
import string
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(SEED)
ANIMALS = [
"cat",
"caracal",
"capybara",
"canary",
"cavy",
"caiman",
"cacomistle",
"caribou",
"cassowary",
"caterpillar",
"dog",
...
] # Omitted as it is long
FRUITS_VEGGIES = [
"apple",
"apricot",
"avocado",
"artichoke",
"banana",
"bilberry",
"blackberry",
"blueberry",
"boysenberry",
"breadfruit",
"cantaloupe",
"casaba",
...
] # Omitted as it is long
# 1. For dataset: Approximately 1000 types in total, including animal names + fruit/vegetable names
NAMES = ANIMALS + FRUITS_VEGGIES
# 2. Creating character vocabulary
ALL_CHARS = sorted(set("".join(NAMES)))
SPECIAL_TOKENS = ["<PAD>", "<BOS>", "<EOS>"]
ALL_TOKENS = SPECIAL_TOKENS + ALL_CHARS
VOCAB_SIZE = len(ALL_TOKENS)
CHAR2IDX = {ch: i for i, ch in enumerate(ALL_TOKENS)}
IDX2CHAR = {i: ch for ch, i in CHAR2IDX.items()}
PAD_IDX = CHAR2IDX["<PAD>"]
BOS_IDX = CHAR2IDX["<BOS>"]
EOS_IDX = CHAR2IDX["<EOS>"]
def encode_word(word, max_len):
tokens = [BOS_IDX] + [CHAR2IDX[c] for c in word] + [EOS_IDX]
tokens += [PAD_IDX] * (max_len - len(tokens))
return tokens
def decode_tokens(tokens):
chars = []
for idx in tokens:
if idx == EOS_IDX:
break
if idx >= len(IDX2CHAR):
continue
ch = IDX2CHAR[idx]
if ch not in SPECIAL_TOKENS:
chars.append(ch)
return "".join(chars)
# 3. PyTorch Dataset
class NameDataset(Dataset):
def __init__(self, words, max_len):
self.max_len = max_len
self.data = []
for w in words:
tokens = encode_word(w, max_len)
self.data.append(tokens)
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
tokens = self.data[idx]
x = torch.tensor(tokens[:-1], dtype=torch.long)
y = torch.tensor(tokens[1:], dtype=torch.long)
return x, y
# 4. Simple positional encoding
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len):
super().__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) * (-np.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
if d_model > 1:
pe[:, 1::2] = torch.cos(position * div_term)
self.register_buffer('pe', pe)
def forward(self, x):
# x + self.pe[:x.size(1)].unsqueeze(0)
return x + self.pe[:x.size(1)]
# 5. MiniGPT2 main body (Pre-LN)
class MiniGPT2(nn.Module):
def __init__(self, vocab_size, d_model=32, max_len=16):
super().__init__()
self.d_model = d_model
self.embed = nn.Embedding(vocab_size, d_model)
self.pos_enc = PositionalEncoding(d_model, max_len)
# Self-Attention (single head, 1 layer)
# ★ No bias terms. Detailed reasons will be described later.
self.q_linear = nn.Linear(d_model, d_model, bias=False)
self.k_linear = nn.Linear(d_model, d_model, bias=False)
self.v_linear = nn.Linear(d_model, d_model, bias=False)
self.attn_out = nn.Linear(d_model, d_model, bias=False)
# FFN
# ★ No bias terms. Detailed reasons will be described later.
self.ffn = nn.Sequential(
nn.Linear(d_model, d_model, bias=False),
nn.ReLU(),
nn.Linear(d_model, d_model, bias=False)
)
self.ln1 = nn.LayerNorm(d_model)
self.ln2 = nn.LayerNorm(d_model)
self.max_len = max_len
self.attn_weights = None
def forward(self, x, return_attn=False):
emb = self.embed(x)
emb = self.pos_enc(emb)
# ★ Pre-LN structure: LN -> Attention -> Add
attn_in = self.ln1(emb)
Q = self.q_linear(attn_in)
K = self.k_linear(attn_in)
V = self.v_linear(attn_in)
scores = torch.matmul(Q, K.transpose(-2, -1)) / np.sqrt(self.d_model)
# causal mask
mask = torch.triu(torch.ones(scores.size(-2), scores.size(-1)), diagonal=1).bool().to(x.device)
scores = scores.masked_fill(mask, float('-inf'))
attn = torch.softmax(scores, dim=-1)
attn_out = torch.matmul(attn, V)
attn_out = self.attn_out(attn_out)
# ★ self.ln1 has moved into the Pre-LN structure
x1 = emb + attn_out
# ★ Pre-LN -> FFN -> Add
ffn_in = self.ln2(x1)
x2 = x1 + self.ffn(ffn_in)
# ★ weight tying
logits = torch.matmul(x2, self.embed.weight.t())
if return_attn:
self.attn_weights = attn.detach().cpu().numpy()
return logits, attn
return logits
def generate(self, start_tokens, eos_idx, pad_idx, max_gen=None, temperature=1.0):
self.eval()
max_gen = max_gen or self.max_len
tokens = start_tokens.tolist()
for _ in range(max_gen - len(tokens)):
inp = torch.tensor(tokens, dtype=torch.long).unsqueeze(0).to(next(self.parameters()).device)
logits = self.forward(inp)
next_token_logits = logits[0, len(tokens)-1] / temperature
probs = torch.softmax(next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1).item()
if next_token == eos_idx:
break
tokens.append(next_token)
while len(tokens) < self.max_len:
tokens.append(pad_idx)
return tokens
Training Loop
# 6. Training loop
def train(model, loader, optimizer, criterion, device):
model.train()
total_loss = 0
for x, y in loader:
x = x.to(device)
y = y.to(device)
optimizer.zero_grad()
logits = model(x)
loss = criterion(logits.view(-1, VOCAB_SIZE), y.view(-1))
loss.backward()
optimizer.step()
total_loss += loss.item()
return total_loss / len(loader)
def evaluate(model, loader, criterion, device):
model.eval()
total_loss = 0
with torch.no_grad():
for x, y in loader:
x = x.to(device)
y = y.to(device)
logits = model(x)
loss = criterion(logits.view(-1, VOCAB_SIZE), y.view(-1))
total_loss += loss.item()
return total_loss / len(loader)
Training
%%time
# Settings
max_word_len = max(len(w) for w in NAMES) + 2 # BOS, EOS
batch_size = 16
d_model = 32
n_epochs = 100
device = "cuda" if torch.cuda.is_available() else "cpu"
# Data splitting
random.seed(42)
random.shuffle(NAMES)
split = int(len(NAMES) * 0.8)
train_words = NAMES[:split]
test_words = NAMES[split:]
train_ds = NameDataset(train_words, max_word_len)
test_ds = NameDataset(test_words, max_word_len)
train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)
# Model
model = MiniGPT2(VOCAB_SIZE, d_model, max_word_len).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)
# Training
for epoch in range(1, n_epochs+1):
train_loss = train(model, train_loader, optimizer, criterion, device)
test_loss = evaluate(model, test_loader, criterion, device)
if epoch % 5 == 0:
print(f"Epoch {epoch:2d}: train loss={train_loss:.4f}, test loss={test_loss:.4f}")
Epoch 5: train loss=3.2279, test loss=3.1276
...
Epoch 100: train loss=2.2263, test loss=2.4894
CPU times: user 18.7 s, sys: 710 ms, total: 19.4 s
Wall time: 21 s
Generation Examples
# 7. Word generation
def sample_generate(prompt, model, max_word_len, temperature=1.0):
start_tokens = [BOS_IDX] + [CHAR2IDX[c] for c in prompt]
start_tokens = torch.tensor(start_tokens, dtype=torch.long)
out_tokens = model.generate(start_tokens, eos_idx=EOS_IDX, pad_idx=PAD_IDX, max_gen=max_word_len, temperature=temperature)
return decode_tokens(out_tokens[1:]) # Exclude BOS
# Example: Generate something animal-like starting with "ca"
print("Generation example:", sample_generate("ca", model, max_word_len, temperature=0.8))
Generation example: categow
print("Generation example:", sample_generate("do", model, max_word_len, temperature=0.8))
Generation example: dock
print("Generation example:", sample_generate("app", model, max_word_len, temperature=0.8))
Generation example: appot
print("Generation example:", sample_generate("sh", model, max_word_len, temperature=0.8))
Generation example: shanelon
The results turned out like this. I have fine-tuned the model several times, and other examples obtained in that process included catelow, docket, appine, shoon, categothar, dotes, appinch, shabeer, categot, docketep, appincho, and she. While some are a bit questionable, there is a sense that "one can feel an atmosphere like some kind of name."
Explanation by GPT-4.1 and Gemini Flash 2.5
I will provide some excerpts. I had them compare it with the reference implementation gpt-2, specifically model.py. They pointed out things I would have never noticed on my own.
Main Alignment Points
- Basic Structure (Decoder-type Transformer, Autoregressive)
- Both the official implementation and your implementation follow the flow: "Embedding → Positional Encoding → Multiple Blocks (Attention + FFN) → Output Layer."
- LayerNorm Position (Pre-LN)
- In the official implementation's
block()function, LayerNorm (norm(x, ...)) is applied before Attention and MLP (FFN) (Pre-LN).- Your MiniGPT2 has also been corrected to use Pre-LN.
- Residual Connections
- Official:
x = x + a(Attention output),x = x + m(MLP output).- MiniGPT2 is the same (
x1 = emb + attn_out,x2 = x1 + ffn(...)).
- Causal Mask
- Both the official and PyTorch implementations use
triufor the upper triangular mask to ensure autoregression.
- Output Layer
Official implementation (openai/gpt-2)
- After the final LayerNorm, the input embedding matrix
wte(equivalent toself.embed.weight) is transposed and used for linear transformation (weight tying, weight sharing).- This provides a reduction in parameter count and training stability.
MiniGPT2 (This implementation)
- After the final LayerNorm and FFN, logits are calculated with
torch.matmul(x2, self.embed.weight.t()), sharing the embedding weights and output weights (weight tying).- Like the official implementation, weight tying has been correctly introduced, leading to a structure and behavior that is more "GPT-2-like."
- Disabling Bias in Q, K, V, attn_out, and FFN
Official implementation (from openai/gpt-2/src/model.py)
All linear layers for Q, K, V, attn_out, and FFN (MLP) are designed without bias terms.
- In the official implementation, these linear transformations are implemented with
conv1d(..., ...), and while bias terms are defined viab = tf.get_variable('b', [nf], initializer=tf.constant_initializer(0)), disabling bias terms is the de facto standard in the latest PyTorch/TensorFlow implementations due to the interaction with weight tying and LayerNorm.- OpenAI's PyTorch implementations (e.g., in the transformers library) are also unified with
bias=False.Specifically, bias is disabled (
bias=False) in the following parts:
- Q, K, V: Split of
c_attn- Attention output:
c_proj- FFN layers:
c_fc,c_projOur MiniGPT2 implementation (PyTorch)
bias=Falseis explicitly set in each linear layer for Q, K, V, attn_out, and FFN (MLP).- This completely aligns with the design philosophy of the official GPT-2 implementation (no bias in Q, K, V, attn_out, or FFN).
Supplementary Notes
- Biases exist in LayerNorm and Embedding, but this is also the case in the official implementation.
- The main reasons for disabling bias include: "not needed in combination with LayerNorm," "reduction in parameter count," and "stabilizing training behavior."
Future Outlook and Application Examples
- For other applications, according to GPT-4.1, it might be possible to repurpose the model for typo correction tasks by "using a word with an intentional one-character error as a prompt to see if the correct word can be generated." In essence, this would be a form of anomaly detection using a type of Autoencoder. It could be evaluated using the match rate between the correct answer and the model output, or the Levenshtein distance (edit distance).
- While it might be a bit difficult given the simplicity of the model, another possibility is to replace the dataset with one consisting of character names to use it as an automatic character naming tool.
Summary
With this, I feel like I've finally gotten a bit of a feel for what GPT-2 is all about.
Looking back, this series of articles is what I would have wanted to read if there had been an "Extended Edition" of Deep Learning from Scratch 2. When I was reading "Deep Learning from Scratch 2," generative AI hadn't developed to this extent, and I felt that I couldn't properly get my hands on things like Transformers. However, in the current landscape, I've become able to receive guidance with the help of AI.
I wonder what I should do next.
Discussion