PowerInferをとりあえず試す

RTX4090で。

とりあえずREADME通りにやればOK。

$ git clone https://github.com/SJTU-IPADS/PowerInfer && cd PowerInfer
$ cmake -S . -B build -DLLAMA_CUBLAS=ON
$ cmake --build build --config Release

LLaMA(ReLU)-2-70Bで。

$ wget --content-disposition https://huggingface.co/PowerInfer/ReluLLaMA-70B-PowerInfer-GGUF/resolve/main/llama-70b-relu.q4.powerinfer.gguf?download=true

推論、いまいちパラメータがわかってない。

$ ./build/bin/main -m ./llama-70b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"

結果。5.8トークン/秒。

Once upon a time, in a land not so far away, there was a small kingdom which had been ruled by the same family for many generations. The king and his wife were growing old but still produced babies. Their youngest child was Prince Charming. One day he met a lovely young woman called Cinderella when she visited the palace to seek employment as a maidservant. She immediately captured his heart and he decided that he would marry her, even though his mother had told him not to fall in love with anybody because she wanted him to marry a princess from another kingdom. The king agreed, but only
llama_print_timings:        load time =    1840.35 ms
llama_print_timings:      sample time =      18.89 ms /   128 runs   (    0.15 ms per token,  6777.51 tokens per second)
llama_print_timings: prompt eval time =     442.96 ms /     5 tokens (   88.59 ms per token,    11.29 tokens per second)
llama_print_timings:        eval time =   21801.46 ms /   127 runs   (  171.67 ms per token,     5.83 tokens per second)
llama_print_timings:       total time =   22286.68 ms
Log end

VRAM消費は11GBぐらいで収まってる・・・・

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         On | 00000000:01:00.0 Off |                  Off |
|  0%   49C    P2               67W / 450W|  11011MiB / 24564MiB |      9%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

基本的にデフォルトでVRAM自動オフロードされるらしい。--vram-budgetをつけるとVRAM消費を抑えることができる。逆に24GBフルで使えないのかな？と思っていろいろ試してみたけど、できなかった。

※2023/12/21 17:20 パラメータが上と揃ってなかったので修正してやり直した。

$ ./build/bin/main -m ./llama-70b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --vram-budget 4

3.6トークン/秒

Once upon a time, there was the most wonderful creature on earth. This creature had gorgeous wings and a beautiful face but it didn't matter to her because she fell in love with one of her own kind.
They were both so happy that they forgot everything else and soon realized that they couldn't be without each other. The two lovers decided to get married.
The wedding was absolutely perfect. Their friends and family came from all over the world just to watch them say their vows to one another.
And then it happened. A beautiful baby boy with wings like his parents was born. And so were
llama_print_timings:        load time =  250454.13 ms
llama_print_timings:      sample time =      20.20 ms /   128 runs   (    0.16 ms per token,  6337.26 tokens per second)
llama_print_timings: prompt eval time =     680.86 ms /     5 tokens (  136.17 ms per token,     7.34 tokens per second)
llama_print_timings:        eval time =   35323.65 ms /   127 runs   (  278.14 ms per token,     3.60 tokens per second)
llama_print_timings:       total time =   36050.46 ms

VRAMは5GBぐらいに収まっている。

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         On | 00000000:01:00.0 Off |                  Off |
|  0%   50C    P2               54W / 450W|   5049MiB / 24564MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

日本語はちょっと試してみたけど予想通り全然ダメだったので割愛。

kun432

比較のため、llama.cppでLLaMA-2-70B（llama-2-70b.Q4_K_M.gguf）を動かしてみる。

オプションが全く同じかはわからないけど、ヘルプを見る限りは同じに見えるので、PowerInferと合わせてみた。LLaMA-2-70Bは最大83レイヤーみたいなので、-ngl は色々試してみて比較。

$ ./main -m ./llama-2-70b.Q4_K_M.gguf -n 128 -t 8 -p "Once upon a time" -ngl XX

-ngl 10。PowerInferで-vram-budget 4指定時とだいたい同じぐらいのVRAM使用量。

llama_print_timings:        load time =  134298.17 ms
llama_print_timings:      sample time =      41.13 ms /   128 runs   (    0.32 ms per token,  3112.08 tokens per second)
llama_print_timings: prompt eval time =    1369.63 ms /     5 tokens (  273.93 ms per token,     3.65 tokens per second)
llama_print_timings:        eval time =   94539.86 ms /   127 runs   (  744.41 ms per token,     1.34 tokens per second)
llama_print_timings:       total time =   96091.27 ms

→ 1.3トークン/秒

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         On | 00000000:01:00.0 Off |                  Off |
|  0%   48C    P2               50W / 450W|   6355MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

-ngl 20。PowerInfer通常使用時とだいたい同じぐらいのVRAM使用量。

llama_print_timings:        load time =    2163.81 ms
llama_print_timings:      sample time =      41.67 ms /   128 runs   (    0.33 ms per token,  3071.46 tokens per second)
llama_print_timings: prompt eval time =    1210.37 ms /     5 tokens (  242.07 ms per token,     4.13 tokens per second)
llama_print_timings:        eval time =   83850.63 ms /   127 runs   (  660.24 ms per token,     1.51 tokens per second)
llama_print_timings:       total time =   85403.43 ms

→ 1.5トークン/秒

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         On | 00000000:01:00.0 Off |                  Off |
|  0%   48C    P2               49W / 450W|  11217MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

-ngl 46。VRAM24GBをフルに使うギリギリのライン。これ以上は起動時点でout of memoryになった。

llama_print_timings:        load time =    2942.85 ms
llama_print_timings:      sample time =      41.69 ms /   128 runs   (    0.33 ms per token,  3070.35 tokens per second)
llama_print_timings: prompt eval time =     796.69 ms /     5 tokens (  159.34 ms per token,     6.28 tokens per second)
llama_print_timings:        eval time =   52104.01 ms /   127 runs   (  410.27 ms per token,     2.44 tokens per second)
llama_print_timings:       total time =   53046.24 ms

→ 2.4トークン/秒

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         On | 00000000:01:00.0 Off |                  Off |
|  0%   50C    P2               50W / 450W|  23711MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

kun432

確かにllama.cppよりも、VRAM使用量抑えつつ、高速に生成できている。VRAM使用量を明示的に制限できるのも良い。なお、品質については今回は深く見ていない。

kun432

ただし以下の制約があるみたい。

kun432

~~あ、一部パラメータ間違ってる！ただしい比較になってない。修正する。~~ PowerInferで--vram-budgetを指定した際の他のオプションの値が最初と揃ってなかったので、修正して再実行した内容に書き換えた。

このスクラップは2023/12/21にクローズされました