nanoGPT で GPT 実装と training を味見するメモ


ChatGPT っぽいやつ, 自前でも実装や training できるように理解したいね...

nanoGPT がありました!



標準的な python/pytorch 環境がセットアップされていればそのまま動きます.

$ conda create -n nanogpt python=3.10
$ conda activate nanogpt
$ git clone https://github.com/karpathy/nanoGPT

requirements.txt が無いので手動 pip します.

$ python -m pip install torch numpy transformers datasets tiktoken wandb tqdm


$ python data/shakespeare_char/prepare.py

length of dataset in characters: 1,115,394
all the unique characters: 
vocab size: 65
train has 1,003,854 tokens
val has 111,540 tokens

Train します!

$ python train.py config/train_shakespeare_char.py

GPU ない場合は --device=cpu で CPU で train します.

iter 4930: loss 0.8147, time 83.83ms, mfu 4.40%
iter 4940: loss 0.8028, time 83.77ms, mfu 4.40%
iter 4950: loss 0.8258, time 82.90ms, mfu 4.41%
iter 4960: loss 0.8346, time 83.06ms, mfu 4.42%
iter 4970: loss 0.7972, time 83.55ms, mfu 4.42%
iter 4980: loss 0.8012, time 83.56ms, mfu 4.43%
iter 4990: loss 0.8412, time 83.07ms, mfu 4.43%
step 5000: train loss 0.6253, val loss 1.6949
iter 5000: loss 0.8252, time 9155.13ms, mfu 3.99%
$ python sample.py --out_dir=out-shakespeare-char

How now, and you first: you rest like the crown,
'Tis some could in the swelling hath a soldier?

Then is Angelo, the time of the Bolingbroke,
I shall not be contented by him.
Come, know your age, rascal!
You will not conclude among by his households,
But not he to drink a humble for 'twas put to us,
And he was not to be but a wife of it.

Happily, here, my lord, my lord, I have a virtue,
And I am slain. As you are you thrue as my soul.


Her charge is there, having through
That strived that his letter did trunk our interious
Embargard than this courtty bids thee down,
Destroy'd this married by me with the prince,
And do I now breathe most doubt of his night
To the royal disloyal drumset here: do not feel
In land-discontented upon her hate,
Bring in his war
To the earth o' the feast, whereof he does said him the
necessessity in it. A prisoner of the habit
To father of his haste. These of face bitings
Would you keep, I did the Cre


fine tuning する!

fine tuning は train とそれほど変わるものではありません.

char ベースではないデータセット用意します.

$ python data/shakespeare/prepare.py
$ python train.py config/finetune_shakespeare.py

デフォでは gpt2-xl(1558M params. 6.4 GB) の pretrained model を huggingface から落としてきますので, 必要に応じてサイズ小さくしておきます.

今回は Tesla P100 16GB で fine tuning しました.
bf16 サポートしておらず, また pytorch2.0 での triton compiler では Tesla P100(CUDA capability 6.0)はサポートしていないため(CUDA capability 7.0 or later のみ),

train.py で, float16 利用と compile = False しておきます.

config/finetune_shakespeare.py では, gpt2-medium を指定しておきます(1.5 GB).
gpt2-large(3.2 GB) 以上は CUDA メモリ不足でした.

| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla P100-PCIE-16GB            On | 00000000:03:00.0 Off |                    0 |
| N/A   75C    P0              129W / 125W|   9584MiB / 16384MiB |    100%      Default |
|                                         |                      |                  N/A |

メモリ消費は 9.5 GB ほど.
(fp16 では)概ね元モデルの 8~10 倍はメモリが必要という感じでしょうか.

step 15: train loss 3.2630, val loss 3.1418
iter 15: loss 3.1496, time 45894.53ms, mfu 0.89%
iter 16: loss 3.2725, time 25267.80ms, mfu 0.90%
iter 17: loss 2.8428, time 25303.24ms, mfu 0.91%
iter 18: loss 2.7150, time 25267.56ms, mfu 0.92%
iter 19: loss 2.8012, time 25260.80ms, mfu 0.93%
step 20: train loss 3.1946, val loss 3.1525
iter 20: loss 3.2055, time 47285.73ms, mfu 0.89%
WHAT IS THIS?                                                                                                                         
A GUY                                                              
THIS IS A MESSAGE FROM A BEAUTIFUL GUY                                                                                                
* * *                            
A GUY                                                                                                                                 
This is a message from a lovely lady.                                                                                                 
A GUY                                                                                                                                 
This is a message from a mighty lady.
A GUY                                                                                                                                 
This is a message from a not so very beautiful lady.               

A GUY                                                                                                                                 
This is a message from a very ugly lady.
A GUY                                                                                                                                 
This is a message from a very ugly man.                            
This is a message from a very ugly man.


<|endoftext|> とか出てくるからトークナイザがおかしい?



8 x A100 があれば GPT2(gpt2-xl) 学習が 4 日でできます...!!!

fp16 化したり, Pytorch lightning(deepspeed)使えば 2 x 3090 or 4 x 3090 で学習できるかも?

もしくは 2 x 3090 x 4 node とか.
multinode で学習する場合, どこのご家庭にもある InfiniBand 接続 mini cluster がおすすめです.



QDR(40 Gbps(実効 32 Gbps)) x 8 の IS5022 スイッチは ebay 1.5 万円くらいで調達できます.

あとはおとなしく gpt2-medium(350M params)を Tesla P100 16GB x 1 でじっくり時間かけて学習ですかね.

GPT2 再現学習は TODO ですね.