【ML】What memory size is required in inference compared to training?
This time, I would like to consider how little VRAM is required for inference compared to training.
1. Why does inference require little VRAM than training
The inference requires lower VRAM than training because it doesn't need to store the dradient and optimizer's parameters.
While the exact reduction in memory depends on the architecture of the model and the specific task, a general rule of thumb is that inference can use around 50% to 70% less memory than training, given that gradients and optimizer states are not required that under the same batch size conditions.
2. Batch size
Memory usage scales linearly with the batch size. If you halve the batch size, memory usage will approximately halve, assuming that the other components (like model parameters) stay constant.
This because the each gradient data for backpropagation are stored while batch training, so simply reducing batch size make to reduce the memory usage.
※Technically, it is possible to reduce memory usage without reducing the batch size by calculating the gradient within the batch for each data, retaining only that gradient information, and averaging it after calculating the batch size; however, this method is not used in practice because it significantly increases training time (the same time as batch size = 1).
3. Summary
In conclusion, if the batch size is the same, memory usage during inference will be reduced by about 30-50%, and reducing the batch size will reduce memory usage in roughly linear proportion.
Reducing the batch size during inference will increase the inference time, but performance will not change, so if you reduce it to 1/5 or 1/10, you will be able to run the model with about 5-10% of the memory compared to training.
(For example, if you infer using 16GB, you will be able to use a model trained with 80-160GB.)
These are not perfect values, but I think it is useful to remember them as a guideline.
That's all for now. Thank you for reading.
Discussion