📖

QLORAをgoogle colabで試してみた。

2023/05/25に公開

QLoRAとは

QLoRAとは4bitでquantizeすることにより、大幅にLLMモデルのメモリを減らして動かせるようにしたプロジェクトです。
huggingfaceで4bitが利用できるようになりました。
https://huggingface.co/blog/4bit-transformers-bitsandbytes

リンク

Colab
github

準備

Google Colabを開き、メニューから「ランタイム→ランタイムのタイプを変更」でランタイムを「GPU」に変更します。

環境構築

インストール手順です。

!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git

推論

(0)利用できるモデル

[
    'bigbird_pegasus', 'blip_2', 'bloom', 'bridgetower', 'codegen', 'deit', 'esm', 
    'gpt2', 'gpt_bigcode', 'gpt_neo', 'gpt_neox', 'gpt_neox_japanese', 'gptj', 'gptsan_japanese', 
    'lilt', 'llama', 'longformer', 'longt5', 'luke', 'm2m_100', 'mbart', 'mega', 'mt5', 'nllb_moe', 
    'open_llama', 'opt', 'owlvit', 'plbart', 'roberta', 'roberta_prelayernorm', 'rwkv', 'switch_transformers', 
    't5', 'vilt', 'vit', 'vit_hybrid', 'whisper', 'xglm', 'xlm_roberta'
]  

(1) モデルのロード
EulerAIのgptneox20bをロードしてみましょう。普通のgoogle colabだと絶対にロードできないが果たして

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "EleutherAI/gpt-neox-20b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

で、出来た。信じられない。凄すぎです。

(2)推論
20bモデルとか初めてのことすぎる。

text = "Hello my name is"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

with torch.no_grad():
  outputs = model_4bit.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

推論は結構時間かかります。

output

Hello my name is john and i am a student at the university of phoenix. i am a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national society of black students. i am also a member of the phoenix chapter of the national

おーなかなか流暢。推論できたことにちょっと感動だけどなんかめっちゃ繰り返してるww

補足

30bモデルも試してみようとしましたが、google colab freeではdisk容量が足りなくて実行できなさそうでした。

Advanced application

finetuningの方法についても載せます。
(1)モデルのロード

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "EleutherAI/gpt-neox-20b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

# apply peft model
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8, 
    lora_alpha=32, 
    target_modules=["query_key_value"], 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

(2)datasetの準備

from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

(3)Training

import transformers

# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

# save peft model
peft_name = "outputs_4bit"
trainer.model.save_pretrained(peft_name)
tokenizer.save_pretrained(peft_name)

(4)推論の実行
一度google colabを再起動してください。

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# モデルの準備
model_id = "EleutherAI/gpt-neox-20b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

# apply peft model
peft_name = "outputs_4bit"
model = PeftModel.from_pretrained(
    model, 
    peft_name, 
    device_map={"":0}
)
model.eval()

text = "Hello my name is"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

with torch.no_grad():
  outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

output

Hello my name is john and I am a student at the university of the south. I am a member of the university of the south's chapter of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because I am a member of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because I am a member of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because I am a member of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because I am a member of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because I am a member of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because I am a member of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because I am a member of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because I am a member of the fraternity of the yearbook. I am a member of the fraternity of the yearbook because

finetuningはもっとしないとダメ(笑笑)

最後に

今回はQLoRAを試してみました。GPTQとかLlamaCPPを使わなくてもhuggingface transformersから4bit化されたモデルを利用できる日が来ました!しかもfinetuningまで実行可能。
QLoRAはやりそうですね!

今後ともLLM, Diffusion model, Image Analysis, 3Dに関連する試した記事を投稿していく予定なのでよろしくお願いします。

Discussion