Closed10

Llama3におけるバッチサイズの設定について

nariaki3551nariaki3551

Model FLOPs Utilization (MFU) : 大規模言語モデル(LLM)のトレーニング効率を評価するための指標

https://www.adamcasson.com/posts/transformer-flops

The best practice now for reporting LLM training efficiency is known as model FLOPs utilization (MFU) which was proposed in Google's PaLM paper5. The idea is to focus on how efficiently we executed just the necessary model FLOPs. The calculation is quite simple as all we need to do is multiply our FLOPs count by the observed throughput (tokens/sequences per second), and then divide by the theoretical peak FLOPS of our hardware.

MFU = CD / P

, where C is the model's FLOPs per token, D is the observed tokens per second, and P is the theoretical peak FLOPS.

C ... トークンごとの処理に必要なFLOPs
D ... 1 秒ごとに処理したトークン数 ( 観測値 )
P ... 理論的なピークFLOPS

P は例えばGPUであれば、ドキュメントに記載の演算性能を用いる

nariaki3551nariaki3551

Hardware FLOPS Utilization (HFU) :

https://github.com/baochi0212/ml-engineering-tips/blob/master/training/performance/README.md#mfu-vs-hfu

MFU = Estimated_Achieved_FLOPS / Theoretical_FLOPS
HFU = Actual_Achieved_FLOPS / Theoretical_FLOPS

MFUではGradient checkpointing/Activation Recomputionは無視し(無理に計算に考慮することはできると思うが)、HFU はそれらを含めた性能値であるためより厳密になる。

For example Megatron-LM published the following stats for A100-80GB:

Model Size Model FLOPs Utilization Hardware FLOPs Utilization
22B 41.5% 43.7%
175B 51.4% 52.8%
530B 56.0% 57.0%
1T 56.3% 57.0%

Megatron-LM では activation recomputation を行っているので MFU < HFU となっている。計算が増えるほどHFUがMFUよりも大きくなる。

https://github.com/mosaicml/llm-foundry/tree/main/scripts/train/benchmarking#mfu-and-hfu

上記により多くの環境でのMFUとHFU が報告されている。

nariaki3551nariaki3551

3.3.2 Parallelism for Model Scaling

  • global batch size は 2048, 2048, 64
  • DP の並列数が良いほど MFU が高い(下二行)

GPU utilization. Through careful tuning of the parallelism configuration, hardware, and software, we achieve an overall BF16 Model FLOPs Utilization (MFU; chowdhery2023palm) of 38-43% for the configurations shown in Table 4. The slight drop in MFU to 41% on 16K GPUs with DP=128 compared to 43% on 8K GPUs with DP=64 is due to the lower batch size per DP group needed to keep the global tokens per batch constant during training.

GPUs TP CP PP DP Seq. Len. Batch size/DP Tokens/Batch TFLOPs/GPU BF16 MFU
8,192 8 1 16 64 8,192 32 16M 430 43%
16,384 8 1 16 128 8,192 16 16M 400 41%
16,384 8 16 16 4 131,072 16 16M 380 38%

Table 4: Scaling configurations and MFU for each stage of Llama 3 405B pre-training. See text and Figure 5 for descriptions of each type of parallelism.

nariaki3551nariaki3551

3.3.2 Parallelism for Model Scaling

Batch size constraint. Current implementations have constraints on supported batch size per GPU, requiring it to be divisible by the number of pipeline stages. For the example in Figure 6, the depth-first schedule (DFS) of pipeline parallelism (narayanan2021efficient) requires N = PP = 4 , while the breadth-first schedule (BFS; lamy2023breadth) requires N = M , where M is the total number of micro-batches and N is the number of contiguous micro-batches for the same stage’s forward or backward. However, pre-training often needs flexibility to adjust batch size.

バッチサイズにはパイプライン並列のステージ数で割り切れる必要がある。(実装的な制約? それとも実行効率か...)

nariaki3551nariaki3551

3.2.1 Scaling Laws

We use a fixed batch size for each compute scale, ranging between 250K and 4M.

スケーリング則の実験では250K から 4M の間の固定バッチサイズが使用されて(大きくない?)

3.4.1 Initial Pre-Training

では次のようにあるので、上記の数字はbatch size ではなく token だと思う。

We pre-train Llama 3 405B using AdamW with a peak learning rate of 8 × 10 − 5 , a linear warm up of 8,000 steps, and a cosine learning rate schedule decaying to 8 × 10 − 07 o ver 1,200,000 steps. We use a lower batch size early in training to improve training stability, and increase it subsequently to improve efficiency. Specifically, we use an initial batch size of 4M tokens and sequences of length 4,096, and double these values to a batch size of 8M sequences of 8,192 tokens after pre-training 252M tokens. We double the batch size again to 16M after pre-training on 2.87T tokens. We found this training recipe to be very stable: we observed few loss spikes and did not require interventions to correct for model training divergence.

nariaki3551nariaki3551

7 Vision Experiments

7.4 Pre-training

Image. We initialize from the pre-trained text model and vision encoder weights. The vision encoder is unfrozen, while the text model weights are kept frozen as explained above. First, we train the model using 6B image-text pairs where each image is resized to fit within four tiles of 336 × 336 pixels. We use a global batch size of 16,384 and a cosine learning rate schedule with initial learning rate 10 × 10 − 4 and a weight decay of 0.01 . The initial learning rate was determined based on small-scale experiments. However, these findings did not generalize well to very long training schedules and dropped the learning rate a few times during training when the loss values became stagnant. After the base pre-training, we increase the image resolution further and continue training the same weights on the annealing dataset. The optimizer is re-initialized via warm-up to learning rate 2 × 10 − 5 and again follows a cosine schedule.

Video. For video pre-training, we start from the image pre-trained and annealed weights as described above. We add the video aggregator and cross-attention layers as described in the architecture, initialized randomly. We freeze all the parameters in the model except the video-specific ones (the aggregator and video cross-attention), and train them on the video pre-training data. We use the same training hyperparameters as the image annealing stage, with small differences in the learning rate. We uniformly sample 16 frames from the full video, and represent each frame using four chunks, each of size of 448 × 448 pixels. We use an aggregation factor of 16 in the video aggregator, hence obtaining one effective frame, which the text tokens cross-attend to. We use a global batch size of 4,096, a sequence length of 190 tokens, and a learning rate of 10 − 4 during training.

このスクラップは2024/11/18にクローズされました