🐇

【ML】Gradient Accumulation explained

2024/08/22に公開

1. What is the Gradient Accumulation?

Gradient accumulation is a technique used in training deep learning models when the available hardware (like GPUs) has insufficient memory to process large batches of data. Instead of updating the model weights after every small batch, gradient accumulation allows you to accumulate gradients over several small batches and perform a weight update only after accumulating gradients equivalent to a larger batch size with smaller VRAM.

2. Procedure

2.1 Small Batch learning

You define a small batch size that can fit into your hardware’s memory and training normally.

2.2 Gradient Accumulation

Instead of applying the gradients immediately, you accumulate them over multiple small batches. This means the gradients are added up (summed) over several iterations.
When the stacked small batch is the same as the objective size, you update the model weights and then reset the accumulated gradients to zero.

And repeating.

3. Example Code

Here is the example code.

import torch
from torch.optim import Adam
from torch.nn import CrossEntropyLoss
from torch.utils.data import DataLoader

model = MyModel() 
dataset = MyDataset()
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# optimizer and loss function
optimizer = Adam(model.parameters(), lr=1e-3)
criterion = CrossEntropyLoss()

# accumulation steps, actual batch size is dataloader's batch_size * accumulation steps.
accumulation_steps = 4

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()  # Clear gradients at the start of each epoch
    
    for i, (inputs, targets) in enumerate(dataloader):
        inputs, targets = inputs.to(device), targets.to(device)
        
        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        
        # Normalize loss to account for accumulation steps
        loss = loss / accumulation_steps
        
        # Backward pass
        loss.backward()
        
        # Perform optimizer step only after accumulating enough gradients
        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()  # Clear gradients after updating
    
    # Handle the case when the last batch doesn't match accumulation_steps
    if (i + 1) % accumulation_steps != 0:
        optimizer.step()
        optimizer.zero_grad()

print("Training complete!")

4. Advantages

Advantages
・Can train with lower GPU memory
・Stable training from smoother gradient calculation.
Disadvantage
・Low training speed because each batch size is small

The training speed is low, but the low memory is so useful.

5. Summary

Gradient Accumulation is a technique for treating big batch sizes with low memory size.
please try it.

Discussion