Closed7
日英・英日機械翻訳モデル ALMA-7B-Ja を試す
以下を参考に、ローカルで雑に動かしてみる。
環境
- CPU: Intel Core i9-13900F
- メモリ: 96GB(32GBx2 + 16GB×2) PC5-38400(DDR5-4800) DDR5 SDRAM
- GPU: NVIDIA GeForce RTX 4090 24GB
- OS: Ubutu 22.04
$ mkdir ALMA-7B-ja && cd ALMA-7B-ja
$ pyenv virtualenv 3.10.13 ALMA-7B-ja
$ pyenv local ALMA-7B-ja
$ git lfs install
$ git clone https://huggingface.co/webbigdata/ALMA-7B-Ja
$ pip install huggingface_hub==0.17.3 \
dataclasses \
transformers==4.34.0 \
accelerate==0.23.0 \
sentencepiece
あと自分の環境ではこれも必要だった。
$ pip install protobuf
スクリプト
sample.py
import torch
from transformers import AutoModelForCausalLM
from transformers import LlamaTokenizer
model_dir = "ALMA-7B-Ja"
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.float16, device_map="auto")
tokenizer = LlamaTokenizer.from_pretrained(model_dir)
prompt1="Translate this from Japanese to English:\nJapanese: ベンチマークを走らせたらNaNが含まれているとエラーが出た時は絶望的な気分になりました\nEnglish:"
input_ids = tokenizer(prompt1, return_tensors="pt", padding=True, max_length=200, truncation=True).input_ids.cuda()
# Translation
with torch.no_grad():
generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=250, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)
prompt2="Translate this from English to Japanese:\nEnglish: Is it possible to improve the quality of translation with such a small model?\nJapanese:"
input_ids = tokenizer(prompt2, return_tensors="pt", padding=True, max_length=200, truncation=True).input_ids.cuda()
with torch.no_grad():
generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=250, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)
実行
$ python sample.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.01s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
['Translate this from Japanese to English:\nJapanese: ベンチマークを走らせたらNaNが含まれているとエラーが出た時は絶望的な気分になりました\nEnglish:When I ran the benchmark and got NaN errors, I felt hopeless']
['Translate this from English to Japanese:\nEnglish: Is it possible to improve the quality of translation with such a small model?\nJapanese:こんな小さなモデルで翻訳の品質を向上させることは可能ですか。']
VRAM消費は15GBぐらい。
Wed Oct 18 06:53:40 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off |
| 0% 58C P2 237W / 450W| 14855MiB / 24564MiB | 98% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1122 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 1359 G /usr/bin/gnome-shell 10MiB |
| 0 N/A N/A 2939994 C .../.pyenv/versions/ALMA-7B/bin/python 14830MiB |
GPTQによる量子化済モデルも提供されている。
同じディレクトリで。
AutoGPTQをインストール
$ pip install autogptq
量子化済モデルをクローン
$ git clone https://huggingface.co/webbigdata/ALMA-7B-Ja-GPTQ-Ja-En
スクリプト
sample-autogptq.py
import torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
quantized_model_dir = "ALMA-7B-Ja-GPTQ-Ja-En"
model_basename = "gptq_model-4bit-128g"
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)
model = AutoGPTQForCausalLM.from_quantized(
quantized_model_dir,
model_basename=model_basename,
use_safetensors=True,
device="cuda:0")
prompt1="Translate this from Japanese to English:\nJapanese: 量子化するとモデルの性能はどのくらい劣化してしまうのでしょうか?\nEnglish:"
input_ids = tokenizer(prompt1, return_tensors="pt", padding=True, max_length=200, truncation=True).input_ids.cuda()
# Translation
with torch.no_grad():
generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=250, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)
prompt2="Translate this from English to Japanese:\nEnglish: I hope that this quantization model will be useful to people.?\nJapanese:"
input_ids = tokenizer(prompt2, return_tensors="pt", padding=True, max_length=200, truncation=True).input_ids.cuda()
with torch.no_grad():
generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=250, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)
実行
$ python sample-autogptq.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
['Translate this from Japanese to English:\nJapanese: 量子化するとモデルの性能はどのくらい劣化してしまうのでしょうか?\nEnglish:How much does the performance of the model degrade when it is quantized?']
['Translate this from English to Japanese:\nEnglish: I hope that this quantization model will be useful to people.?\nJapanese:この量子モデルが人々の役に立つことを望みます。']
VRAM消費は5GBぐらい。
Wed Oct 18 07:35:36 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off |
| 0% 55C P2 68W / 450W| 5143MiB / 24564MiB | 42% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1122 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 1359 G /usr/bin/gnome-shell 10MiB |
+---------------------------------------------------------------------------------------+
その他
バッチ翻訳のサンプルも提供されている。別途ベンチマークを取ってみたい。
llama.cpp向けにGGUF版もある。
llama.cppは既にあってcuBLAS有効になっているものとして以下は進める。
GGUF版のレポジトリをクローン
$ git lfs install
$ git clone https://huggingface.co/mmnga/webbigdata-ALMA-7B-Ja-gguf
プロンプトを環境変数に設定
# 評価用日本語文 Japanese text for evaluation
TRANSLATE_TEMPLATE_Ja="Translate this from Japanese to English:\nJapanese: ずんだもんは東北に住む活発でかわいい女の子です。あなた「きみの名前は?」ずんだもん「ボクの名前はずんだもんなのだ。」\nEnglish:"
# 評価用英語文 English text for evaluation
TRANSLATE_TEMPLATE_En="Translate this from English to Japanese:\nEnglish:The book named A Tale of Two Cities was written by Charles. Mary read the book.\nJapanese:"
日→英の翻訳。q4_K_Mでやってみる。
$ ./main -m ./webbigdata-ALMA-7B-Ja-gguf/webbigdata-ALMA-7B-Ja-q4_K_M.gguf -n 128 -p "$TRANSLATE_TEMPLATE_Ja"
GPUオフロードする場合の参考に。
llm_load_tensors: offloaded 0/35 layers to GPU
Translate this from Japanese to English:\nJapanese: ずんだもんは東北に住む活発でかわいい女の子です。あなた「きみの名前は?」ずんだもん「ボクの名前はずんだもんなのだ。」\nEnglish:Zunda-mon is an active and pretty girl who lives in the Tohoku area. "What is your name?" she said. "I am Zunda Mon," she replied. [end of text]
llama_print_timings: load time = 261.76 ms
llama_print_timings: sample time = 12.29 ms / 39 runs ( 0.32 ms per token, 3173.83 tokens per second)
llama_print_timings: prompt eval time = 620.64 ms / 79 tokens ( 7.86 ms per token, 127.29 tokens per second)
llama_print_timings: eval time = 3587.77 ms / 38 runs ( 94.42 ms per token, 10.59 tokens per second)
llama_print_timings: total time = 4230.17 ms
Log end
英→日の翻訳。
$ ./main -m ./webbigdata-ALMA-7B-Ja-gguf/webbigdata-ALMA-7B-Ja-q4_K_M.gguf -n 128 -p "$TRANSLATE_TEMPLATE_En"
Translate this from English to Japanese:\nEnglish:The book named A Tale of Two Cities was written by Charles. Mary read the book.\nJapanese:「T・ウェントワース物語」という名の本はチャールズによって書かれた。メアリーはその本を読んだ。 [end of text]
llama_print_timings: load time = 257.82 ms
llama_print_timings: sample time = 14.75 ms / 52 runs ( 0.28 ms per token, 3524.95 tokens per second)
llama_print_timings: prompt eval time = 574.74 ms / 37 tokens ( 15.53 ms per token, 64.38 tokens per second)
llama_print_timings: eval time = 4744.71 ms / 51 runs ( 93.03 ms per token, 10.75 tokens per second)
llama_print_timings: total time = 5346.32 ms
Log end
-ngl 35
をつけてフルオフロードしてみた場合の日→英翻訳はこんな感じだった。
$ time ./main -m ./webbigdata-ALMA-7B-Ja-gguf/webbigdata-ALMA-7B-Ja-q4_K_M.gguf -n 128 -p "$TRANSLATE_TEMPLATE_Ja" -ngl 25
Translate this from Japanese to English:\nJapanese: ずんだもんは東北に住む活発でかわいい女の子です。あなた「きみの名前は?」ずんだもん「ボクの名前はずんだもんなのだ。」\nEnglish:Zundamon is a lively and beautiful girl who lives in Tohoku. "What's your name?" asked Zundamon. "My name is Zundamon." [end of text]
llama_print_timings: load time = 543.20 ms
llama_print_timings: sample time = 10.89 ms / 41 runs ( 0.27 ms per token, 3764.23 tokens per second)
llama_print_timings: prompt eval time = 55.20 ms / 79 tokens ( 0.70 ms per token, 1431.11 tokens per second)
llama_print_timings: eval time = 266.79 ms / 40 runs ( 6.67 ms per token, 149.93 tokens per second)
llama_print_timings: total time = 340.61 ms
Log end
real 0m1.632s
user 0m0.922s
sys 0m0.652s
VRAM消費は5GBぐらい。
M2 Pro MacでGGUF試してみた。q4_K_Mで。
- Mac mini 2023
- 12コアCPU、19コアGPU、16コアNeural Engine搭載Apple M2 Proチップ
- 32GBユニファイドメモリ
GPUオフロードなし
$ ./main -m ./models/webbigdata-ALMA-7B-Ja-q4_K_M.gguf -n 128 -p "$TRANSLATE_TEMPLATE_Ja" -ngl 0
Translate this from Japanese to English:\nJapanese: ずんだもんは東北に住む活発でかわいい女の子です。あなた「きみの名前は?」ずんだもん「ボクの名前はずんだもんなのだ。」\nEnglish:Zunda-man is a lively and lovely girl living in the Tohoku region. She says, "My name is Zunda-man." [end of text]
llama_print_timings: load time = 194.66 ms
llama_print_timings: sample time = 21.89 ms / 33 runs ( 0.66 ms per token, 1507.81 tokens per second)
llama_print_timings: prompt eval time = 5383.92 ms / 79 tokens ( 68.15 ms per token, 14.67 tokens per second)
llama_print_timings: eval time = 1308.55 ms / 32 runs ( 40.89 ms per token, 24.45 tokens per second)
llama_print_timings: total time = 6721.79 ms
GPUオフロードあり
$ ./main -m ./models/webbigdata-ALMA-7B-Ja-q4_K_M.gguf -n 128 -p "$TRANSLATE_TEMPLATE_Ja"
(snip)
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Pro
ggml_metal_init: picking default device: Apple M2 Pro
(snip)
Translate this from Japanese to English:\nJapanese: ずんだもんは東北に住む活発でかわいい女の子です。あなた「きみの名前は?」ずんだもん「ボクの名前はずんだもんなのだ。」\nEnglish:Zundamon is an active, charming girl from the Tohoku district. You ask her "What's your name?" Zundamon replies, "My name is Zundamon." [end of text]
llama_print_timings: load time = 666.81 ms
llama_print_timings: sample time = 29.53 ms / 45 runs ( 0.66 ms per token, 1524.13 tokens per second)
llama_print_timings: prompt eval time = 347.14 ms / 79 tokens ( 4.39 ms per token, 227.58 tokens per second)
llama_print_timings: eval time = 1265.47 ms / 44 runs ( 28.76 ms per token, 34.77 tokens per second)
llama_print_timings: total time = 1649.88 ms
GPUオフロードなしだとM2 Macのほうが速いが、GPUオフロードしたらRTX4090のほうがやっぱり速い。あと、GPUオフロードなしだとprompt eval timeが長かった。推論時間も多少は違うのだけど、この部分で差がついている感じ。
とはいえ、どちらにしてもM2 Macならローカルで十分すぎる。
v2が出てる
このスクラップは2023/10/26にクローズされました