Code Llamaを試してみる（VRAM24GB）

量子化モデルはすでにあるけれども、素のCode LlamaでGPUメモリがどれぐらい使われるのかやってみる。

GPU: RTX4090(VRAM24GB)
Python-3.10.11/pyenv+pyenv-virtualenv

$ pip install git+https://github.com/huggingface/transformers.git@main accelerate

テストコード。

generate.py

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import re

tokenizer = AutoTokenizer.from_pretrained(
    "codellama/CodeLlama-7b-Instruct-hf",
)
model = AutoModelForCausalLM.from_pretrained(
    "codellama/CodeLlama-7b-Instruct-hf",
    torch_dtype=torch.float16,
    device_map="auto",
)

def chat():
    print("AI: Code Llamaです。生成したいコードをプロンプトで指示してください。やめたい場合は'終了'と入力してください。")

    while True:
        user_input = input("あなた: ")

        if user_input == "終了":
            print("AI: さようなら！")
            break

        if not user_input or user_input == "":
            print("AI: 何か指示してください。")
        else:
            prompt = f"<s>[INST] {user_input.strip()} [/INST]"

            inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
            outputs = model.generate(
                inputs["input_ids"],
                do_sample=True,
                temperature=0.2,
                top_p=0.9,
                eos_token_id=tokenizer.eos_token_id,
                max_length=1024,
            )

            output = tokenizer.decode(outputs[0].to("cpu"))
            output = re.sub(r'<s> \[INST\].*?\[/INST\]\s+', '', output)
            output = re.sub(r'</s>', '\n', output)
            print(f"AI: {output}")

if __name__ == "__main__":
    chat()

実行。初回はモデルのダウンロードが行われるので時間がかかる。

$ python generate.py
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.03s/it]
AI: Code Llamaです。生成したいコードをプロンプトで指示してください。やめたい場合は'終了'と入力してください。
あなた: pythonで100以下のフィボナッチ数列を計算するコードを生成してください
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
AI: Pythonでは、フィボナッチ数列を計算するには、以下のようなコードを使用できます。
```
def fibonacci(n):
    if n <= 1:
        return n
    else:
        return fibonacci(n-1) + fibonacci(n-2)
```
このコードは、フィボナッチ数列のn番目の値を返す関数です。nが1または2の場合、その値を返します。それ以外の場合は、n-1番目とn-2番目の値を計算し、それらの和を返します。

このコードを使用して、100以下のフィボナッチ数列を計算するには、以下のようにします。
```
print(fibonacci(100))
```
このコードは、100番目のフィボナッチ数列を計算し、その値を表示します。

あなた: 終了
AI: さようなら！

kun432

CodeLlama-7b-Instruction-hf

14〜15GB

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         On | 00000000:01:00.0 Off |                  Off |
|  0%   60C    P2              239W / 450W|  14837MiB / 24564MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1125      G   /usr/lib/xorg/Xorg                            9MiB |
|    0   N/A  N/A      1358      G   /usr/bin/gnome-shell                         10MiB |
|    0   N/A  N/A   1075390      C   ...yenv/versions/code-llama/bin/python    14812MiB |
+---------------------------------------------------------------------------------------+

kun432

CodeLlama-13b-Instruction-hf

20〜23GBぐらいだけど、まあ遅い。待てど暮らせどなかなか返ってこない。

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         On | 00000000:01:00.0 Off |                  Off |
| 31%   43C    P8               13W / 450W|  21979MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1125      G   /usr/lib/xorg/Xorg                            9MiB |
|    0   N/A  N/A      1358      G   /usr/bin/gnome-shell                         10MiB |
|    0   N/A  N/A   1125676      C   ...yenv/versions/code-llama/bin/python    21954MiB |
+---------------------------------------------------------------------------------------+

そうかー、4090でも素の13bだとこんなもんかー

kun432

CodeLlama-34b-Instruction-hf

13bであれなので当然動くわけもないだろうということでスキップ。

次はllama.cpp（GGUF）を試すことにする。

kun432

このスクラップは2023/09/02にクローズされました