🐧

Paperspace で Swallow MS 7B の QLoRAファインチューニングを試す

2024/03/15に公開

Paperspaceを使用して、Swallow MS 7Bの推論とQLoRAによるファインチューニングの方法をまとめました。

GPU：A100(80GB)
Python 3.10.10

※PaperspaceのデフォルトはPython 3.9なので、プロジェクト作成時のContainerの部分に「cyberes/gradient-base-py3.10:latest」を入力すると3.10に設定されるよ。

Swallow MS 7Bとは

Swallow MS 7Bは、日本語に特化した大規模言語モデル（LLM）です。まるで日本語の達人のように、様々な文章を理解し、生成することができます。

QLoRAとは

QLoRAとは、大規模言語モデル(LLM)のファインチューニングにおいて、モデルの全パラメータを更新する代わりに、小規模な適応可能なベクトル集合を追加することで、効率的にモデルを調整する手法です。

推論編

まずは推論のみで試してみました。
QLoRAを使いたい場合はこちらを飛ばしても大丈夫です。

パッケージのインストール

!pip install -U transformers torch
!pip install accelerate

トークナイザーとモデルの準備

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "tokyotech-llm/Swallow-MS-7b-v0.1"
)
model = AutoModelForCausalLM.from_pretrained(
    "tokyotech-llm/Swallow-MS-7b-v0.1",
    torch_dtype=torch.float16,
    device_map="auto",
)

モデルが次々とダウンロードされるのでしばらく時間がかかります。

推論の実行

# プロンプトの準備
prompt = "AIから見た人類の面白い行動をトップ3を教えて"

# 推論の実行
encoding = tokenizer.encode_plus(
    prompt,
    add_special_tokens=False,
    return_tensors="pt",
    return_attention_mask=True
)

input_ids = encoding["input_ids"]

# attention_mask を取得
attention_mask = encoding["attention_mask"]

tokens = model.generate(
    input_ids.to(device=model.device),
    attention_mask=attention_mask.to(device=model.device),
    max_new_tokens=128,
    temperature=0.99,
    top_p=0.95,
    do_sample=True,
)
out = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(out)

Swallow MS 7Bからの回答

AIから見た人類の面白い行動をトップ3を教えて
- 人工知能が「あがり症」や「人見知り」を改善する技術を研究開発
- なぜディープラーニングは人間の脳を真似たのか?人工知能の歴史を振り返る
- なぜAIは「人種」を分類できるのか?機械学習を知って「人種」の歴史を知ろう
- 【漫画】人工知能とどう接すれば良いのか?
- 【漫画】人間の仕事を奪うのはロボットではない
- 【漫画】人工知能が人間より「知性」が高い

QLoRA ファインチューニング編

「ござるデータセット」を使わせていただきました。

パッケージのインストール

https://github.com/artidoro/qlora

%cd notebooks
!git clone https://github.com/artidoro/qlora
%cd qlora
!pip install -U -r requirements.txt

HuggingFaceのログイン(ターミナルで実行)

Hugging faceのアクセストークンが事前に必要。
Hugging faceにログインし、SettingsのAccess Tokensから発効すれば入手できるよ。

Google Colabであれば、以下を実行すれば出力画面からアクセストークンを入力できるのだが、Paperspaceでは入力できなかった。

!huggingface-cli login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible):  ←ここに入力できるはず…

やり方がわからなかったのでターミナルを起動して入力しました。

学習

npaka先生の記事を参考にqlora.pyを修正します。
https://note.com/npaka/n/na7c631175111

以下のコードだけ追加で修正しました。「Seq2SeqTrainer」で検索すればヒットします。

data_module = make_data_module(tokenizer=tokenizer, args=args)
    
    # 'use_seedable_sampler'キーを含む可能性があるため、これも除外する
    filtered_args = {
    k: v for k, v in data_module.items() if k not in ['predict_dataset', 'use_seedable_sampler']
    }
    
    trainer = Seq2SeqTrainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        **filtered_args,
        #**{k:v for k,v in data_module.items() if k != 'predict_dataset'},
    )

学習の実行

エラーが頻発したが何とか学習できた。

%cd /notebooks/qlora
# 学習の実行
!python qlora.py \
    --model_name tokyotech-llm/Swallow-MS-7b-v0.1 \
    --output_dir "./output/test_peft" \
    --dataset "alpaca" \
    --max_steps 1000 \
    --use_auth \
    --logging_steps 10 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 200 \
    --save_total_limit 40 \
    --max_new_tokens 32 \
    --dataloader_num_workers 1 \
    --group_by_length \
    --logging_strategy steps \
    --remove_unused_columns False \
    --do_train \
    --lora_r 64 \
    --lora_alpha 16 \
    --lora_modules all \
    --double_quant \
    --quant_type nf4 \
    --bf16 \
    --bits 4 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type constant \
    --gradient_checkpointing \
    --source_max_len 16 \
    --target_max_len 512 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --eval_steps 187 \
    --learning_rate 0.0002 \
    --adam_beta2 0.999 \
    --max_grad_norm 0.3 \
    --lora_dropout 0.1 \
    --weight_decay 0.0 \
    --seed 0 \
    --load_in_4bit \
    --use_peft \
    --batch_size 16 \
    --gradient_accumulation_steps 2

学習結果

***** train metrics *****
  epoch                    =       0.13
  train_loss               =     1.5996
  train_runtime            = 0:12:09.19
  train_samples_per_second =      2.743
  train_steps_per_second   =      1.371

トークナイザーとモデルの読み込み

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# トークナイザーとモデルの読み込み
tokenizer = AutoTokenizer.from_pretrained(
    "tokyotech-llm/Swallow-MS-7b-v0.1"
)
model = AutoModelForCausalLM.from_pretrained(
    "tokyotech-llm/Swallow-MS-7b-v0.1",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    ),
    device_map={"":0}
)

QLoRAの読み込み。

# LoRAの読み込み
model = PeftModel.from_pretrained(
    model,
    "./output/test_peft/checkpoint-1000/adapter_model/",
    device_map={"":0}
)
model.eval()

QLoRAを使って推論

推論その1

# プロンプトの準備
chat_template = ""
prompt = "俺のことどう思う？"
chat_template = chat_template + "\nUSER: " + prompt + "\nASSISTANT: "

# 推論の実行
inputs = tokenizer(chat_template, return_tensors="pt").to("cuda:0")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        top_k=500,
        top_p=0.95,
        do_sample=True,
        temperature=0.9,
        max_new_tokens=100
        )
temp = tokenizer.decode(outputs[0], skip_special_tokens=True)
lenght = len(temp) - len(chat_template)
chat_template = chat_template + temp[-lenght:]
print(temp[-lenght:])

推論その1の回答

フランクは、映画「スーパーマン リターンズ」「ファンタスティック・フォー」「スパイダーマン」など、大ヒット作に数多く出演しましたでござる。知らんけど。

推論その2

# プロンプトの準備
prompt = "### Instruction: なんJ とは？\n\n### Response: "

# 推論の実行
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

推論その2の回答

### Instruction: なんJ とは？

### Response: なんJは、日本のインターネット掲示板でござる。知らんけど。

つまづいたとこ

ChatGPTとClaudeとGeminiでガンガン質問して無理やり解決させた。

bitsandbytes地獄

RuntimeError: 
        CUDA Setup failed despite GPU being available. Please run the following command to get more information:

        python -m bitsandbytes

        Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
        to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
        and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

相当苦しめられた"bitsandbytes"!!
PaperspaceはLinux環境じゃないのか!?

以下を実行すれば対処できた。

pip uninstall bitsandbytes
pip install bitsandbytes>=0.43.0

mistralのエラー

KeyError: 'mistral'

Swallow-MS-7b-v0.1は'Mistral'系列だからなのか…。
transformersを最新バージョンに進化させれば解決した。

pip install -U transformers

4bitか8bitか問題

ValueError: You can't pass `load_in_4bit`or `load_in_8bit` as a kwarg when passing `quantization_config` argument at the same time.

load_in_4bitとload_in_8bitのどっちを選べばいいんだよ！という叫びだと思います。
qlora.pyの「quantization_config」を検索して、以下のように「#」でコメントアウトすれば実行できたよ！

 model = AutoModelForCausalLM.from_pretrained(
        args.model_name_or_path,
        cache_dir=args.cache_dir,
        load_in_4bit=args.bits == 4,
        load_in_8bit=args.bits == 8,
        device_map=device_map,
        max_memory=max_memory,
        #quantization_config=BitsAndBytesConfig(
            #load_in_4bit=args.bits == 4,
            #load_in_8bit=args.bits == 8,
            #llm_int8_threshold=6.0,
            #llm_int8_has_fp16_weight=False,
            #bnb_4bit_compute_dtype=compute_dtype,
            #bnb_4bit_use_double_quant=args.double_quant,
            #bnb_4bit_quant_type=args.quant_type,
        #),
        torch_dtype=(torch.float32 if args.fp16 else (torch.bfloat16 if args.bf16 else torch.float32)),
        trust_remote_code=args.trust_remote_code,
        use_auth_token=args.use_auth_token
    )

データセットがおかしい

ValueError: Invalid pattern: '**' can only be an entire path component

アップデートしたらOKだった。

pip install -U datasets

accelerate ライブラリが最新でない

self.accelerator = Accelerator(
TypeError: Accelerator.__init__() got an unexpected keyword argument 'use_seedable_sampler'

速さが足りないようなので更に加速させてあげた。

pip install --upgrade accelerate

Paperspace で Swallow MS 7B の QLoRAファインチューニングを試す

Swallow MS 7Bとは

QLoRAとは

推論編

パッケージのインストール

トークナイザーとモデルの準備

推論の実行

QLoRA ファインチューニング編

パッケージのインストール

HuggingFaceのログイン(ターミナルで実行)

学習

学習の実行

トークナイザーとモデルの読み込み

QLoRAの読み込み。

QLoRAを使って推論

つまづいたとこ

関連情報

Discussion