🚥

Hugging Face Accelerateについてまとめ

2023/12/17に公開

機械学習

Hugging Face

accelerator

tech

概要

Accelerateについて学習した際のまとめです。

基本

以下の様に通常のモデルをラップし

+ from accelerate import Accelerator
+ accelerator = Accelerator()

+ model, optimizer, training_dataloader, scheduler = accelerator.prepare(
+     model, optimizer, training_dataloader, scheduler
+ )

  for batch in training_dataloader:
      optimizer.zero_grad()
      inputs, targets = batch
      inputs = inputs.to(device)
      targets = targets.to(device)
      outputs = model(inputs)
      loss = loss_function(outputs, targets)
+     accelerator.backward(loss)
      optimizer.step()
      scheduler.step()

accelerateコマンドで実行する。

accelerate launch {my_script.py}

既存のモデルをDeepSpeed等（その他の最適化も選べる）を利用するように変換し、並列処理や、混合精度トレーニングを自動的にサポートしてくれる。

torch_xla
TPUを利用するためのライブラリ
torch.distributed
PyTorchにおける分散コンピューティングの機能を提供するモジュール

をベースに構築されたライブラリである。

acceleratorで重要そうな部分

configの作成

以下のコマンドで設定ファイルを作成する。

accelerate config

色々と質問されるので答える。

# accelerate config
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
This machine                                                                                                                                                                                                                                    
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Which type of machine are you using?                                                                                                                                                                                                            
multi-GPU                                                                                                                                                                                                                                       
How many different machines will you use (use more than 1 for multi-node training)? [1]: 4                                                                                                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------What is the rank of this machine?                                                                                                                                                                                                               
0                                                                                                                                                                                                                                               
What is the IP address of the machine that will host the main process? 192.168.2.174                                                                                                                                                            
What is the port you will use to communicate with the main process? 12345                                                                                                                                                                       
Are all the machines on the same local network? Answer `no` if nodes are on the cloud and/or on different network hosts [YES/no]: yes                                                                                                           
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: yes                                                                                                              
Do you wish to optimize your script with torch dynamo?[yes/NO]:yes
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Which dynamo backend would you like to use?
inductor                                                                                                                                                                                                                                        
Do you want to customize the defaults sent to torch.compile? [yes/NO]: no                                                                                                                                                                       
Do you want to use DeepSpeed? [yes/NO]: yes                                                                                                                                                                                                     
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: no                                                                                                                                                                          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------What should be your DeepSpeed's ZeRO optimization stage?                                                                                                                                                                                        
2                                                                                                                                                                                                                                               
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Where to offload optimizer states?                                                                                                                                                                                                              
none                                                                                                                                                                                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Where to offload parameters?                                                                                                                                                                                                                    
none                                                                                                                                                                                                                                            
How many gradient accumulation steps you're passing in your script? [1]: 5                                                                                                                                                                      
Do you want to use gradient clipping? [yes/NO]: no                                                                                                                                                                                              
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: no                                                                                                                             
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Which Type of launcher do you want to use?
pdsh                                                                                                                                                                                                                                            
DeepSpeed configures multi-node compute resources with hostfile. Each row is of the format `hostname slots=[num_gpus]`, e.g., `localhost slots=2`; for more information please refer official [documentation](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node). Please specify the location of hostfile: /home/xxx/hostfile                                                                                                                                       
Do you want to specify exclusion filter string? [yes/NO]: no                                                                                                                                                                                    
Do you want to specify inclusion filter string? [yes/NO]: no                                                                                                                                                                                    
How many GPU(s) should be used for distributed training? [1]:16                                                                                                                                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?
fp16                                                                                                                                                                                                                                            
accelerate configuration saved at /root/.cache/huggingface/accelerate/default_config.yaml

作成された設定ファイルの中身は以下。

# cat /root/.cache/huggingface/accelerate/default_config.yaml                                                                                                                            
compute_environment: LOCAL_MACHINE                                                                                                                                                                                                              
debug: true
deepspeed_config:
  deepspeed_hostfile: /home/xxx/hostfile
  deepspeed_multinode_launcher: pdsh
  gradient_accumulation_steps: 5
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_config:
  dynamo_backend: INDUCTOR
machine_rank: 0
main_process_ip: 192.168.2.174
main_process_port: 12345
main_training_function: main
mixed_precision: fp16
num_machines: 4
num_processes: 16
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

データローダーについて

accelerator.prepareにdataloaderを渡すことで、各GPUやTPUがデータセットの一部を分担して処理するようになる。

configファイルの動作検証

下記の設定をしたあとは

accelerate config

以下のコマンドで動作確認をすることができる。

accelerate test

訓練の実行

以下の様に学習スクリプトを実行する。

accelerate launch path_to_script.py --args_for_the_script

notebookからの実行

notebookから実行する際は以下の様に実行する。

from accelerate import notebook_launcher

notebook_launcher(training_function)

TPUを使用するときの注意

TPUで訓練する際は最初のステップ時に初期化処理が走りコストが高くなっている、２ステップ目以降は軽くなる。
ただしそれには以下の条件がある。

すべてのバッチのサイズが同じ
同じコードが動くこと。例えばtep毎にforループの回数が変わったりしないこと

その他にもTPUでの訓練時には注意すべき点がある。

ノード毎に１回だけ動かしたい処理

以下の様にする。

if accelerator.is_local_main_process:
    # Is executed once per server

プログレスバーを出すときなども以下のようにする。

from tqdm.auto import tqdm

progress_bar = tqdm(range(args.max_train_steps), disable=not accelerator.is_local_main_process)

全ノードで１回だけ動かしたい処理

以下のようにする。

if accelerator.is_main_process:
    # Is executed once only

複数GPUで訓練をしている時に同期をとる

複数のプロセスで処理が完了していることを待つために以下のようにする。
例えばモデルの保存前に行う。

accelerator.wait_for_everyone()

訓練中の状態を保存、読み込みする

学習中のオプティマイザ、ランダムジェネレータ、学習率スケジューラなどの状態は以下のメソッドで保存、読込みできる。

save_state
load_state

全ノードの勾配を累積して更新