🌤️

MPT-30B を 3090 x 2 で味見メモ(日本語はだめだめだった)

2023/06/23に公開

LLM

mpt

idea

https://www.mosaicml.com/blog/mpt-30b

Google Colab で MPT-30B を試す
https://note.com/npaka/n/n3416a127e66f

ありがとうございます. npaka 先生は神.

どこのご家庭にもある 3090 x 2 で動かしてみます.

環境

x299 CPU mem 256 GB
3090 x 2
Ubuntu 20.04

今回は chat 版使ってみます.

ライセンスは CC-By-NC-SA-4.0 (商用利用不可)です.

モデルサイズは bfloat16 で 60 GB ほどです.
int8 で 30 GB で 3090 x 2 で動作します!
(GPU なかったら, device_map = "cpu" あたりにして CPU 動作させましょう)

import torch
import transformers

from transformers import AutoTokenizer

name = 'mosaicml/mpt-30b-chat'

#config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
#config.attn_config['attn_impl'] = 'triton'  # change this to use triton-based FlashAttention
#config.init_device = 'cuda:0' # For fast initialization directly on GPU!

tokenizer = AutoTokenizer.from_pretrained('mosaicml/mpt-30b')

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  #config=config,
  # torch_dtype=torch.bfloat16, # Load model weights in bfloat16
  load_in_8bit=True,
  device_map="auto",
  trust_remote_code=True
)

prompt = "Q:Where is recommended place to visit in Tokyo?\nA:"

inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
input_length = inputs.input_ids.shape[1]
outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    do_sample=True,
    temperature=0.8,
    return_dict_in_generate=True
)
token = outputs.sequences[0, input_length:]
output_str = tokenizer.decode(token)

print("output :", output_str)

triton 使う設定は, CUDA error になったのでやめました.
(最新の pytorch やら入っていれば使えるかもはしれません)

load_in_8bit=True, # 8bit load(required bitsandbytes)
device_map="auto", # load into all available GPUs
trust_remote_code=True # 最新 model だとセキュリティ上チェックされていないので, それを無視する

を設定します!

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         Off| 00000000:65:00.0 Off |                  N/A |
| 36%   44C    P5               54W / 160W|  16606MiB / 24576MiB |     80%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090         Off| 00000000:B3:00.0 Off |                  N/A |
|  0%   36C    P5               45W / 160W|  17011MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

メモリ消費は 17 GB + 17 GB という感じでした
(16 GB GPU x 2 でも動いてほしいところですが, 難しいカモ)

The recommended place to visit in Tokyo is the Meiji Shrine, which is dedicated to the deified spirits of Emperor Meiji and his wife, Empress Shoken.

Q:What are some popular attractions in Tokyo?
A:Some popular attractions in Tokyo include the Tokyo Tower, the Tokyo Metropolitan Government Building, and the Sensoji Temple.

Q:What is the best time to visit Tokyo?
A:The best time to visit Tokyo is during the spring months of March to May, when the cherry blossoms are in bloom.

Q:What is the currency used in Japan?
A

ほー, よきよきっぽね.

日本語で試してみます.

prompt = "Q:東京について教えてのください.\nA:"

東京は日本の京都です.教えてください.
B:どういたしまして.
C:私は東京に住んでいるので,東京についてずいぶん分かったと思いますが,私の英語力では話せませんでしたよね.とりあ
えず東京の目玉情報を教えることにしましょう.東京の目玉といえば,大き

うーぬ... 日本語はだめだめですね. ﾌｧｲﾝﾁｭｰﾝなりが求められます.

llama.cpp(ggml) 版?

ggml の repo のほうで MPT 対応がされています.

https://github.com/ggerganov/ggml

たぶん動くと思われます.

まとめ

30B 規模(fp16 でファイルサイズ 60GB 規模)も, 推論はもとより, LoRA ﾌｧｲﾝﾁｭｰﾝも, 量子化や ggml で一般の家庭 PC で動くようになってきました.

いい感じに日本語でﾌｧｲﾝﾁｭｰﾝすれば GPT-3.5(日本語で) くらいにはなりそうやも?

環境

llama.cpp(ggml) 版?

まとめ

Discussion