📖
QLORAをgoogle colabで試してみた。
QLoRAとは
QLoRAとは4bitでquantizeすることにより、大幅にLLMモデルのメモリを減らして動かせるようにしたプロジェクトです。
huggingfaceで4bitが利用できるようになりました。
リンク
準備
Google Colabを開き、メニューから「ランタイム→ランタイムのタイプを変更」でランタイムを「GPU」に変更します。
環境構築
インストール手順です。
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
推論
(0)利用できるモデル
[
'bigbird_pegasus', 'blip_2', 'bloom', 'bridgetower', 'codegen', 'deit', 'esm',
'gpt2', 'gpt_bigcode', 'gpt_neo', 'gpt_neox', 'gpt_neox_japanese', 'gptj', 'gptsan_japanese',
'lilt', 'llama', 'longformer', 'longt5', 'luke', 'm2m_100', 'mbart', 'mega', 'mt5', 'nllb_moe',
'open_llama', 'opt', 'owlvit', 'plbart', 'roberta', 'roberta_prelayernorm', 'rwkv', 'switch_transformers',
't5', 'vilt', 'vit', 'vit_hybrid', 'whisper', 'xglm', 'xlm_roberta'
]
(1) モデルのロード
EulerAIのgptneox20bをロードしてみましょう。普通のgoogle colabだと絶対にロードできないが果たして
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_id = "EleutherAI/gpt-neox-20b"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
で、出来た。信じられない。凄すぎです。
(2)推論
20bモデルとか初めてのことすぎる。
text = "Hello my name is"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model_4bit.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
推論は結構時間かかります。
output
Hello my name is john and i am a student at the university of phoenix. i am a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national
おーなかなか流暢。推論できたことにちょっと感動だけどなんかめっちゃ繰り返してるww
補足
30bモデルも試してみようとしましたが、google colab freeではdisk容量が足りなくて実行できなさそうでした。
Advanced application
finetuningの方法についても載せます。
(1)モデルのロード
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
model_id = "EleutherAI/gpt-neox-20b"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})
# apply peft model
from peft import prepare_model_for_kbit_training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["query_key_value"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
print_trainable_parameters(model)
(2)datasetの準備
from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
(3)Training
import transformers
# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token
trainer = transformers.Trainer(
model=model,
train_dataset=data["train"],
args=transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=2,
max_steps=10,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir="outputs",
optim="paged_adamw_8bit"
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()
# save peft model
peft_name = "outputs_4bit"
trainer.model.save_pretrained(peft_name)
tokenizer.save_pretrained(peft_name)
(4)推論の実行
一度google colabを再起動してください。
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# モデルの準備
model_id = "EleutherAI/gpt-neox-20b"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})
# apply peft model
peft_name = "outputs_4bit"
model = PeftModel.from_pretrained(
model,
peft_name,
device_map={"":0}
)
model.eval()
text = "Hello my name is"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
output
Hello my name is john and I am a student at the university of the south. I am a member of the university of the south's chapter of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because I am a member of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because I am a member of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because I am a member of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because I am a member of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because I am a member of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because I am a member of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because I am a member of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because I am a member of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because
finetuningはもっとしないとダメ(笑笑)
最後に
今回はQLoRAを試してみました。GPTQとかLlamaCPPを使わなくてもhuggingface transformersから4bit化されたモデルを利用できる日が来ました!しかもfinetuningまで実行可能。
QLoRAはやりそうですね!
今後ともLLM, Diffusion model, Image Analysis, 3Dに関連する試した記事を投稿していく予定なのでよろしくお願いします。
Discussion