日英・英日機械翻訳モデル ALMA-7B-Ja を試す

以下を参考に、ローカルで雑に動かしてみる。

環境

CPU: Intel Core i9-13900F
メモリ: 96GB(32GBx2 + 16GB×2) PC5-38400(DDR5-4800) DDR5 SDRAM
GPU: NVIDIA GeForce RTX 4090 24GB
OS: Ubutu 22.04

$ mkdir ALMA-7B-ja && cd ALMA-7B-ja
$ pyenv virtualenv 3.10.13 ALMA-7B-ja
$ pyenv local ALMA-7B-ja
$ git lfs install
$ git clone https://huggingface.co/webbigdata/ALMA-7B-Ja

$ pip install huggingface_hub==0.17.3 \
  dataclasses \
  transformers==4.34.0 \
  accelerate==0.23.0 \
  sentencepiece

あと自分の環境ではこれも必要だった。

$ pip install protobuf

スクリプト

sample.py

import torch
from transformers import AutoModelForCausalLM
from transformers import LlamaTokenizer

model_dir = "ALMA-7B-Ja"

model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.float16, device_map="auto")
tokenizer = LlamaTokenizer.from_pretrained(model_dir)

prompt1="Translate this from Japanese to English:\nJapanese: ベンチマークを走らせたらNaNが含まれているとエラーが出た時は絶望的な気分になりました\nEnglish:"
input_ids = tokenizer(prompt1, return_tensors="pt", padding=True, max_length=200, truncation=True).input_ids.cuda()

# Translation
with torch.no_grad():
    generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=250, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)

prompt2="Translate this from English to Japanese:\nEnglish: Is it possible to improve the quality of translation with such a small model?\nJapanese:"
input_ids = tokenizer(prompt2, return_tensors="pt", padding=True, max_length=200, truncation=True).input_ids.cuda()
with torch.no_grad():
    generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=250, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)

実行

$ python sample.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.01s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
['Translate this from Japanese to English:\nJapanese: ベンチマークを走らせたらNaNが含まれているとエラーが出た時は絶望的な気分になりました\nEnglish:When I ran the benchmark and got NaN errors, I felt hopeless']
['Translate this from English to Japanese:\nEnglish: Is it possible to improve the quality of translation with such a small model?\nJapanese:こんな小さなモデルで翻訳の品質を向上させることは可能ですか。']

VRAM消費は15GBぐらい。

Wed Oct 18 06:53:40 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         On | 00000000:01:00.0 Off |                  Off |
|  0%   58C    P2              237W / 450W|  14855MiB / 24564MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1122      G   /usr/lib/xorg/Xorg                            9MiB |
|    0   N/A  N/A      1359      G   /usr/bin/gnome-shell                         10MiB |
|    0   N/A  N/A   2939994      C   .../.pyenv/versions/ALMA-7B/bin/python    14830MiB |

kun432

GPTQによる量子化済モデルも提供されている。

同じディレクトリで。

AutoGPTQをインストール

$ pip install autogptq

量子化済モデルをクローン

$ git clone https://huggingface.co/webbigdata/ALMA-7B-Ja-GPTQ-Ja-En

スクリプト

sample-autogptq.py

import torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

quantized_model_dir = "ALMA-7B-Ja-GPTQ-Ja-En"
model_basename = "gptq_model-4bit-128g"

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)

model = AutoGPTQForCausalLM.from_quantized(
        quantized_model_dir,
        model_basename=model_basename,
        use_safetensors=True,
        device="cuda:0")

prompt1="Translate this from Japanese to English:\nJapanese: 量子化するとモデルの性能はどのくらい劣化してしまうのでしょうか？\nEnglish:"
input_ids = tokenizer(prompt1, return_tensors="pt", padding=True, max_length=200, truncation=True).input_ids.cuda()

# Translation
with torch.no_grad():
    generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=250, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)

prompt2="Translate this from English to Japanese:\nEnglish: I hope that this quantization model will be useful to people.?\nJapanese:"
input_ids = tokenizer(prompt2, return_tensors="pt", padding=True, max_length=200, truncation=True).input_ids.cuda()
with torch.no_grad():
    generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=250, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)

実行

$ python sample-autogptq.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
['Translate this from Japanese to English:\nJapanese: 量子化するとモデルの性能はどのくらい劣化してしまうのでしょうか？\nEnglish:How much does the performance of the model degrade when it is quantized?']
['Translate this from English to Japanese:\nEnglish: I hope that this quantization model will be useful to people.?\nJapanese:この量子モデルが人々の役に立つことを望みます。']

VRAM消費は5GBぐらい。

Wed Oct 18 07:35:36 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         On | 00000000:01:00.0 Off |                  Off |
|  0%   55C    P2               68W / 450W|   5143MiB / 24564MiB |     42%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1122      G   /usr/lib/xorg/Xorg                            9MiB |
|    0   N/A  N/A      1359      G   /usr/bin/gnome-shell                         10MiB |
+---------------------------------------------------------------------------------------+

kun432

その他

バッチ翻訳のサンプルも提供されている。別途ベンチマークを取ってみたい。

kun432

llama.cpp向けにGGUF版もある。

llama.cppは既にあってcuBLAS有効になっているものとして以下は進める。

GGUF版のレポジトリをクローン

$ git lfs install
$ git clone https://huggingface.co/mmnga/webbigdata-ALMA-7B-Ja-gguf

プロンプトを環境変数に設定

# 評価用日本語文 Japanese text for evaluation
TRANSLATE_TEMPLATE_Ja="Translate this from Japanese to English:\nJapanese: ずんだもんは東北に住む活発でかわいい女の子です。あなた「きみの名前は？」ずんだもん「ボクの名前はずんだもんなのだ。」\nEnglish:"

# 評価用英語文 English text for evaluation
TRANSLATE_TEMPLATE_En="Translate this from English to Japanese:\nEnglish:The book named A Tale of Two Cities was written by Charles. Mary read the book.\nJapanese:"

日→英の翻訳。q4_K_Mでやってみる。

$ ./main -m ./webbigdata-ALMA-7B-Ja-gguf/webbigdata-ALMA-7B-Ja-q4_K_M.gguf -n 128 -p "$TRANSLATE_TEMPLATE_Ja"

GPUオフロードする場合の参考に。

llm_load_tensors: offloaded 0/35 layers to GPU

 Translate this from Japanese to English:\nJapanese: ずんだもんは東北に住む活発でかわいい女の子です。あなた「きみの名前は？」ずんだもん「ボクの名前はずんだもんなのだ。」\nEnglish:Zunda-mon is an active and pretty girl who lives in the Tohoku area. "What is your name?" she said. "I am Zunda Mon," she replied. [end of text]

llama_print_timings:        load time =     261.76 ms
llama_print_timings:      sample time =      12.29 ms /    39 runs   (    0.32 ms per token,  3173.83 tokens per second)
llama_print_timings: prompt eval time =     620.64 ms /    79 tokens (    7.86 ms per token,   127.29 tokens per second)
llama_print_timings:        eval time =    3587.77 ms /    38 runs   (   94.42 ms per token,    10.59 tokens per second)
llama_print_timings:       total time =    4230.17 ms
Log end

英→日の翻訳。

$ ./main -m ./webbigdata-ALMA-7B-Ja-gguf/webbigdata-ALMA-7B-Ja-q4_K_M.gguf -n 128 -p "$TRANSLATE_TEMPLATE_En"

 Translate this from English to Japanese:\nEnglish:The book named A Tale of Two Cities was written by Charles. Mary read the book.\nJapanese:「Ｔ・ウェントワース物語」という名の本はチャールズによって書かれた。メアリーはその本を読んだ。 [end of text]

llama_print_timings:        load time =     257.82 ms
llama_print_timings:      sample time =      14.75 ms /    52 runs   (    0.28 ms per token,  3524.95 tokens per second)
llama_print_timings: prompt eval time =     574.74 ms /    37 tokens (   15.53 ms per token,    64.38 tokens per second)
llama_print_timings:        eval time =    4744.71 ms /    51 runs   (   93.03 ms per token,    10.75 tokens per second)
llama_print_timings:       total time =    5346.32 ms
Log end

-ngl 35をつけてフルオフロードしてみた場合の日→英翻訳はこんな感じだった。

$ time ./main -m ./webbigdata-ALMA-7B-Ja-gguf/webbigdata-ALMA-7B-Ja-q4_K_M.gguf -n 128 -p "$TRANSLATE_TEMPLATE_Ja" -ngl 25

 Translate this from Japanese to English:\nJapanese: ずんだもんは東北に住む活発でかわいい女の子です。あなた「きみの名前は？」ずんだもん「ボクの名前はずんだもんなのだ。」\nEnglish:Zundamon is a lively and beautiful girl who lives in Tohoku. "What's your name?" asked Zundamon. "My name is Zundamon." [end of text]

llama_print_timings:        load time =     543.20 ms
llama_print_timings:      sample time =      10.89 ms /    41 runs   (    0.27 ms per token,  3764.23 tokens per second)
llama_print_timings: prompt eval time =      55.20 ms /    79 tokens (    0.70 ms per token,  1431.11 tokens per second)
llama_print_timings:        eval time =     266.79 ms /    40 runs   (    6.67 ms per token,   149.93 tokens per second)
llama_print_timings:       total time =     340.61 ms
Log end

real	0m1.632s
user	0m0.922s
sys	0m0.652s

VRAM消費は5GBぐらい。

kun432

M2 Pro MacでGGUF試してみた。q4_K_Mで。

Mac mini 2023
12コアCPU、19コアGPU、16コアNeural Engine搭載Apple M2 Proチップ
32GBユニファイドメモリ

GPUオフロードなし

$ ./main -m ./models/webbigdata-ALMA-7B-Ja-q4_K_M.gguf -n 128 -p "$TRANSLATE_TEMPLATE_Ja" -ngl 0

Translate this from Japanese to English:\nJapanese: ずんだもんは東北に住む活発でかわいい女の子です。あなた「きみの名前は？」ずんだもん「ボクの名前はずんだもんなのだ。」\nEnglish:Zunda-man is a lively and lovely girl living in the Tohoku region. She says, "My name is Zunda-man." [end of text]

llama_print_timings:        load time =     194.66 ms
llama_print_timings:      sample time =      21.89 ms /    33 runs   (    0.66 ms per token,  1507.81 tokens per second)
llama_print_timings: prompt eval time =    5383.92 ms /    79 tokens (   68.15 ms per token,    14.67 tokens per second)
llama_print_timings:        eval time =    1308.55 ms /    32 runs   (   40.89 ms per token,    24.45 tokens per second)
llama_print_timings:       total time =    6721.79 ms

GPUオフロードあり

$ ./main -m ./models/webbigdata-ALMA-7B-Ja-q4_K_M.gguf -n 128 -p "$TRANSLATE_TEMPLATE_Ja"
(snip)
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Pro
ggml_metal_init: picking default device: Apple M2 Pro
(snip)

Translate this from Japanese to English:\nJapanese: ずんだもんは東北に住む活発でかわいい女の子です。あなた「きみの名前は？」ずんだもん「ボクの名前はずんだもんなのだ。」\nEnglish:Zundamon is an active, charming girl from the Tohoku district. You ask her "What's your name?" Zundamon replies, "My name is Zundamon." [end of text]

llama_print_timings:        load time =     666.81 ms
llama_print_timings:      sample time =      29.53 ms /    45 runs   (    0.66 ms per token,  1524.13 tokens per second)
llama_print_timings: prompt eval time =     347.14 ms /    79 tokens (    4.39 ms per token,   227.58 tokens per second)
llama_print_timings:        eval time =    1265.47 ms /    44 runs   (   28.76 ms per token,    34.77 tokens per second)
llama_print_timings:       total time =    1649.88 ms

GPUオフロードなしだとM2 Macのほうが速いが、GPUオフロードしたらRTX4090のほうがやっぱり速い。あと、GPUオフロードなしだとprompt eval timeが長かった。推論時間も多少は違うのだけど、この部分で差がついている感じ。

とはいえ、どちらにしてもM2 Macならローカルで十分すぎる。

kun432

v2が出てる

https://webbigdata.jp/post-21151/

このスクラップは2023/10/26にクローズされました