【ML】Gradient Accumulation explained
1. What is the Gradient Accumulation?
Gradient accumulation is a technique used in training deep learning models when the available hardware (like GPUs) has insufficient memory to process large batches of data. Instead of updating the model weights after every small batch, gradient accumulation allows you to accumulate gradients over several small batches and perform a weight update only after accumulating gradients equivalent to a larger batch size with smaller VRAM.
2. Procedure
2.1 Small Batch learning
You define a small batch size that can fit into your hardware’s memory and training normally.
2.2 Gradient Accumulation
Instead of applying the gradients immediately, you accumulate them over multiple small batches. This means the gradients are added up (summed) over several iterations.
When the stacked small batch is the same as the objective size, you update the model weights and then reset the accumulated gradients to zero.
And repeating.
3. Example Code
Here is the example code.
import torch
from torch.optim import Adam
from torch.nn import CrossEntropyLoss
from torch.utils.data import DataLoader
model = MyModel()
dataset = MyDataset()
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# optimizer and loss function
optimizer = Adam(model.parameters(), lr=1e-3)
criterion = CrossEntropyLoss()
# accumulation steps, actual batch size is dataloader's batch_size * accumulation steps.
accumulation_steps = 4
# Training loop
for epoch in range(num_epochs):
optimizer.zero_grad() # Clear gradients at the start of each epoch
for i, (inputs, targets) in enumerate(dataloader):
inputs, targets = inputs.to(device), targets.to(device)
# Forward pass
outputs = model(inputs)
loss = criterion(outputs, targets)
# Normalize loss to account for accumulation steps
loss = loss / accumulation_steps
# Backward pass
loss.backward()
# Perform optimizer step only after accumulating enough gradients
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad() # Clear gradients after updating
# Handle the case when the last batch doesn't match accumulation_steps
if (i + 1) % accumulation_steps != 0:
optimizer.step()
optimizer.zero_grad()
print("Training complete!")
4. Advantages
Advantages
・Can train with lower GPU memory
・Stable training from smoother gradient calculation.
Disadvantage
・Low training speed because each batch size is small
The training speed is low, but the low memory is so useful.
5. Summary
Gradient Accumulation is a technique for treating big batch sizes with low memory size.
please try it.
Discussion