PowerInferをとりあえず試す
RTX4090で。
とりあえずREADME通りにやればOK。
$ git clone https://github.com/SJTU-IPADS/PowerInfer && cd PowerInfer
$ cmake -S . -B build -DLLAMA_CUBLAS=ON
$ cmake --build build --config Release
LLaMA(ReLU)-2-70Bで。
$ wget --content-disposition https://huggingface.co/PowerInfer/ReluLLaMA-70B-PowerInfer-GGUF/resolve/main/llama-70b-relu.q4.powerinfer.gguf?download=true
推論、いまいちパラメータがわかってない。
$ ./build/bin/main -m ./llama-70b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"
結果。5.8トークン/秒。
Once upon a time, in a land not so far away, there was a small kingdom which had been ruled by the same family for many generations. The king and his wife were growing old but still produced babies. Their youngest child was Prince Charming. One day he met a lovely young woman called Cinderella when she visited the palace to seek employment as a maidservant. She immediately captured his heart and he decided that he would marry her, even though his mother had told him not to fall in love with anybody because she wanted him to marry a princess from another kingdom. The king agreed, but only
llama_print_timings: load time = 1840.35 ms
llama_print_timings: sample time = 18.89 ms / 128 runs ( 0.15 ms per token, 6777.51 tokens per second)
llama_print_timings: prompt eval time = 442.96 ms / 5 tokens ( 88.59 ms per token, 11.29 tokens per second)
llama_print_timings: eval time = 21801.46 ms / 127 runs ( 171.67 ms per token, 5.83 tokens per second)
llama_print_timings: total time = 22286.68 ms
Log end
VRAM消費は11GBぐらいで収まってる・・・・
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off |
| 0% 49C P2 67W / 450W| 11011MiB / 24564MiB | 9% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
基本的にデフォルトでVRAM自動オフロードされるらしい。--vram-budget
をつけるとVRAM消費を抑えることができる。逆に24GBフルで使えないのかな?と思っていろいろ試してみたけど、できなかった。
※2023/12/21 17:20 パラメータが上と揃ってなかったので修正してやり直した。
$ ./build/bin/main -m ./llama-70b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --vram-budget 4
3.6トークン/秒
Once upon a time, there was the most wonderful creature on earth. This creature had gorgeous wings and a beautiful face but it didn't matter to her because she fell in love with one of her own kind.
They were both so happy that they forgot everything else and soon realized that they couldn't be without each other. The two lovers decided to get married.
The wedding was absolutely perfect. Their friends and family came from all over the world just to watch them say their vows to one another.
And then it happened. A beautiful baby boy with wings like his parents was born. And so were
llama_print_timings: load time = 250454.13 ms
llama_print_timings: sample time = 20.20 ms / 128 runs ( 0.16 ms per token, 6337.26 tokens per second)
llama_print_timings: prompt eval time = 680.86 ms / 5 tokens ( 136.17 ms per token, 7.34 tokens per second)
llama_print_timings: eval time = 35323.65 ms / 127 runs ( 278.14 ms per token, 3.60 tokens per second)
llama_print_timings: total time = 36050.46 ms
VRAMは5GBぐらいに収まっている。
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off |
| 0% 50C P2 54W / 450W| 5049MiB / 24564MiB | 2% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
日本語はちょっと試してみたけど予想通り全然ダメだったので割愛。
比較のため、llama.cppでLLaMA-2-70B(llama-2-70b.Q4_K_M.gguf)を動かしてみる。
オプションが全く同じかはわからないけど、ヘルプを見る限りは同じに見えるので、PowerInferと合わせてみた。LLaMA-2-70Bは最大83レイヤーみたいなので、-ngl
は色々試してみて比較。
$ ./main -m ./llama-2-70b.Q4_K_M.gguf -n 128 -t 8 -p "Once upon a time" -ngl XX
-ngl 10
。PowerInferで-vram-budget 4
指定時とだいたい同じぐらいのVRAM使用量。
llama_print_timings: load time = 134298.17 ms
llama_print_timings: sample time = 41.13 ms / 128 runs ( 0.32 ms per token, 3112.08 tokens per second)
llama_print_timings: prompt eval time = 1369.63 ms / 5 tokens ( 273.93 ms per token, 3.65 tokens per second)
llama_print_timings: eval time = 94539.86 ms / 127 runs ( 744.41 ms per token, 1.34 tokens per second)
llama_print_timings: total time = 96091.27 ms
→ 1.3トークン/秒
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off |
| 0% 48C P2 50W / 450W| 6355MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
-ngl 20
。PowerInfer通常使用時とだいたい同じぐらいのVRAM使用量。
llama_print_timings: load time = 2163.81 ms
llama_print_timings: sample time = 41.67 ms / 128 runs ( 0.33 ms per token, 3071.46 tokens per second)
llama_print_timings: prompt eval time = 1210.37 ms / 5 tokens ( 242.07 ms per token, 4.13 tokens per second)
llama_print_timings: eval time = 83850.63 ms / 127 runs ( 660.24 ms per token, 1.51 tokens per second)
llama_print_timings: total time = 85403.43 ms
→ 1.5トークン/秒
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off |
| 0% 48C P2 49W / 450W| 11217MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
-ngl 46
。VRAM24GBをフルに使うギリギリのライン。これ以上は起動時点でout of memoryになった。
llama_print_timings: load time = 2942.85 ms
llama_print_timings: sample time = 41.69 ms / 128 runs ( 0.33 ms per token, 3070.35 tokens per second)
llama_print_timings: prompt eval time = 796.69 ms / 5 tokens ( 159.34 ms per token, 6.28 tokens per second)
llama_print_timings: eval time = 52104.01 ms / 127 runs ( 410.27 ms per token, 2.44 tokens per second)
llama_print_timings: total time = 53046.24 ms
→ 2.4トークン/秒
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off |
| 0% 50C P2 50W / 450W| 23711MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
確かにllama.cppよりも、VRAM使用量抑えつつ、高速に生成できている。VRAM使用量を明示的に制限できるのも良い。なお、品質については今回は深く見ていない。
ただし以下の制約があるみたい。
あ、一部パラメータ間違ってる!ただしい比較になってない。修正する。 PowerInferで--vram-budget
を指定した際の他のオプションの値が最初と揃ってなかったので、修正して再実行した内容に書き換えた。