Llama3におけるバッチサイズの設定について

Meta Llama3に関する論文「The Llama 3 Herd of Models」

からバッチサイズに関する記述をまとめる。
途中いろいろ脱線する。

Model FLOPs Utilization (MFU) : 大規模言語モデル（LLM）のトレーニング効率を評価するための指標
https://www.adamcasson.com/posts/transformer-flops
The best practice now for reporting LLM training efficiency is known as model FLOPs utilization (MFU) which was proposed in Google's PaLM paper5. The idea is to focus on how efficiently we executed just the necessary model FLOPs. The calculation is quite simple as all we need to do is multiply our FLOPs count by the observed throughput (tokens/sequences per second), and then divide by the theoretical peak FLOPS of our hardware.
MFU =  CD / P
, where C is the model's FLOPs per token, D is the observed tokens per second, and P is the theoretical peak FLOPS.
C ... トークンごとの処理に必要なFLOPs

D ... 1 秒ごとに処理したトークン数 ( 観測値 )

P ... 理論的なピークFLOPS
P は例えばGPUであれば、ドキュメントに記載の演算性能を用いる

nariaki3551

Hardware FLOPS Utilization (HFU) :
https://github.com/baochi0212/ml-engineering-tips/blob/master/training/performance/README.md#mfu-vs-hfu
MFU = Estimated_Achieved_FLOPS / Theoretical_FLOPS
HFU = Actual_Achieved_FLOPS / Theoretical_FLOPS
MFUではGradient checkpointing/Activation Recomputionは無視し（無理に計算に考慮することはできると思うが）、HFU はそれらを含めた性能値であるためより厳密になる。
For example Megatron-LM published the following stats for A100-80GB:


Model Size
Model FLOPs Utilization
Hardware FLOPs Utilization


22B
41.5%
43.7%

175B
51.4%
52.8%

530B
56.0%
57.0%

1T
56.3%
57.0%

Megatron-LM では activation recomputation を行っているので MFU < HFU となっている。計算が増えるほどHFUがMFUよりも大きくなる。
https://github.com/mosaicml/llm-foundry/tree/main/scripts/train/benchmarking#mfu-and-hfu
上記により多くの環境でのMFUとHFU が報告されている。

Model Size	Model FLOPs Utilization	Hardware FLOPs Utilization
22B	41.5%	43.7%
175B	51.4%	52.8%
530B	56.0%	57.0%
1T	56.3%	57.0%

nariaki3551

Typically you want to make the micro batch size as large as possible so that the GPU memory is close to being full, but not too full.
基本的にglobalバッチサイズは大きければ大きいほどよい。（収束率への影響は？）
https://github.com/baochi0212/ml-engineering-tips/blob/master/training/performance/README.md#batch-sizes

nariaki3551

3.3.2 Parallelism for Model Scaling
global batch size は 2048, 2048, 64
DP の並列数が良いほど MFU が高い（下二行）
GPU utilization. Through careful tuning of the parallelism configuration, hardware, and software, we achieve an overall BF16 Model FLOPs Utilization (MFU; chowdhery2023palm) of 38-43% for the configurations shown in Table 4. The slight drop in MFU to 41% on 16K GPUs with DP=128 compared to 43% on 8K GPUs with DP=64 is due to the lower batch size per DP group needed to keep the global tokens per batch constant during training.


GPUs
TP
CP
PP
DP
Seq. Len.
Batch size/DP
Tokens/Batch
TFLOPs/GPU
BF16 MFU


8,192
8
1
16
64
8,192
32
16M
430
43%

16,384
8
1
16
128
8,192
16
16M
400
41%

16,384
8
16
16
4
131,072
16
16M
380
38%

Table 4: Scaling configurations and MFU for each stage of Llama 3 405B pre-training. See text and Figure 5 for descriptions of each type of parallelism.

GPUs	TP	CP	PP	DP	Seq. Len.	Batch size/DP	Tokens/Batch	TFLOPs/GPU	BF16 MFU
8,192	8	1	16	64	8,192	32	16M	430	43%
16,384	8	1	16	128	8,192	16	16M	400	41%
16,384	8	16	16	4	131,072	16	16M	380	38%

nariaki3551

3.3.2 Parallelism for Model Scaling
Batch size constraint. Current implementations have constraints on supported batch size per GPU, requiring it to be divisible by the number of pipeline stages. For the example in Figure 6, the depth-first schedule (DFS) of pipeline parallelism (narayanan2021efficient) requires  N = PP = 4 , while the breadth-first schedule (BFS; lamy2023breadth) requires  N = M , where  M  is the total number of micro-batches and  N  is the number of contiguous micro-batches for the same stage’s forward or backward. However, pre-training often needs flexibility to adjust batch size.
バッチサイズにはパイプライン並列のステージ数で割り切れる必要がある。（実装的な制約？ それとも実行効率か...）

nariaki3551

3.2.1 Scaling Laws
We use a fixed batch size for each compute scale, ranging between 250K and 4M.
スケーリング則の実験では250K から 4M の間の固定バッチサイズが使用されて（大きくない？）
3.4.1 Initial Pre-Training
では次のようにあるので、上記の数字はbatch size ではなく token だと思う。
We pre-train Llama 3 405B using AdamW with a peak learning rate of  8 × 10 − 5   ,  a linear warm up of 8,000 steps, and a cosine learning rate schedule decaying to  8 × 10 − 07   o ver 1,200,000 steps. We use a lower batch size early in training to improve training stability, and increase it subsequently to improve efficiency. Specifically, we use an initial batch size of 4M tokens and sequences of length 4,096, and double these values to a batch size of 8M sequences of 8,192 tokens after pre-training 252M tokens. We double the batch size again to 16M after pre-training on 2.87T tokens. We found this training recipe to be very stable: we observed few loss spikes and did not require interventions to correct for model training divergence.

nariaki3551

ちなみに、...
Megatron-LM の GPT3 pre-training の example にある例では

global-batch-size 1536 になっている
https://github.com/NVIDIA/Megatron-LM/blob/main/examples/gpt3/train_gpt3_175b_distributed.sh

nariaki3551

7 Vision Experiments
7.4 Pre-training
Image. We initialize from the pre-trained text model and vision encoder weights. The vision encoder is unfrozen, while the text model weights are kept frozen as explained above. First, we train the model using 6B image-text pairs where each image is resized to fit within four tiles of  336 × 336  pixels. We use a global batch size of 16,384 and a cosine learning rate schedule with initial learning rate  10 × 10 − 4  and a weight decay of 0.01 . The initial learning rate was determined based on small-scale experiments. However, these findings did not generalize well to very long training schedules and dropped the learning rate a few times during training when the loss values became stagnant. After the base pre-training, we increase the image resolution further and continue training the same weights on the annealing dataset. The optimizer is re-initialized via warm-up to learning rate  2 × 10 − 5  and again follows a cosine schedule.
Video. For video pre-training, we start from the image pre-trained and annealed weights as described above. We add the video aggregator and cross-attention layers as described in the architecture, initialized randomly. We freeze all the parameters in the model except the video-specific ones (the aggregator and video cross-attention), and train them on the video pre-training data. We use the same training hyperparameters as the image annealing stage, with small differences in the learning rate. We uniformly sample 16 frames from the full video, and represent each frame using four chunks, each of size of  448 × 448  pixels. We use an aggregation factor of 16 in the video aggregator, hence obtaining one effective frame, which the text tokens cross-attend to. We use a global batch size of 4,096, a sequence length of 190 tokens, and a learning rate of  10 − 4  during training.

nariaki3551

ちなみに、...
Megatron-LM の mistral-8 7B の example にある例では
--global-batch-size 256 になっている

このスクラップは2024/11/18にクローズされました