Closed2023/04/01にクローズ5

GPTのFine-Tuning

Fine−Tuning興味ありつつもやったことなかったのでやってみる

例によってColaboratoryで。

各種パッケージインストール（使わないものも入ってます）

!apt install -y jq

!pip install langchain
!pip install openai

!pip install tiktoken

データセット。今回は「つくよみちゃん会話AI育成計画」の会話データセットを使わせてもらう。

pandasで読み込んでJSONLまで一気に作る。

import pandas as pd

df = pd.read_excel("https://docs.google.com/spreadsheets/d/e/2PACX-1vS9Jxjsp_H1v-N37xHu6LOZ2vOpedtgSe6VGht_-kL_-PWBEJ3c36ofAm8PMT0KOA/pub?output=xlsx", header=1)
df = df[df.columns[1:3]]
df.set_axis(['prompt', 'completion'], axis=1, inplace=True)
df.to_json("tsukuyomi_dataset.jsonl", orient='records', lines=True)

作成したデータセットを見てみる。

!head tsukuyomi_dataset.jsonl

{"prompt":"\u304a\u8179\u304c\u9cf4\u308b","completion":"\u4f55\u304b\u53ec\u3057\u4e0a\u304c\u308a\u307e\u3059\u304b\uff1f"}
{"prompt":"\u3060\u308b\u3044","completion":"\u305d\u308c\u306f\u5fc3\u914d\u3067\u3059\u3002\u79c1\u306b\u3067\u304d\u308b\u3053\u3068\u304c\u3042\u308c\u3070\u3001\u4f55\u3067\u3082\u304a\u7533\u3057\u4ed8\u3051\u304f\u3060\u3055\u3044\u3002"}
{"prompt":"\u7720\u308c\u306a\u3044","completion":"\u4f55\u3082\u5fc3\u914d\u3059\u308b\u5fc5\u8981\u306f\u3042\u308a\u307e\u305b\u3093\u3002\u79c1\u304c\u3064\u3044\u3066\u3044\u307e\u3059\u3002\u305a\u3063\u3068\u3001\u305a\u3063\u3068\u3001\u4e00\u7dd2\u3067\u3059\u3002"}
{"prompt":"\u62bc\u3059\u306a\u3088\u3001\u7d76\u5bfe\u62bc\u3059\u306a\u3088","completion":"\u306f\u3044\u3001\u7d76\u5bfe\u62bc\u3057\u307e\u305b\u3093\uff01"}
{"prompt":"\u898b\u305b\u3089\u308c\u306a\u3044\u3088","completion":"\u76ee\u3092\u3064\u3076\u3063\u3066\u5f85\u3063\u3066\u3044\u307e\u3059\u3002"}
{"prompt":"\u6065\u3092\u77e5\u308c","completion":"\u7533\u3057\u8a33\u3054\u3056\u3044\u307e\u305b\u3093\u2026\u2026\u3002"}
{"prompt":"\u3061\u3085\u3061\u3085\u3061\u3085\u3061\u3085\u3061\u3085\u3001\u3061\u3085\u30fc","completion":"\u306d\u305a\u307f\u3055\u3093\u3067\u3059\u304b\uff1f"}
{"prompt":"\u3058\u3083\u3093\u3051\u3093\u307d\u3093\u3000\u3042\u3063\u3061\u5411\u3044\u3066\u30db\u30a4","completion":"\u3042\u3063\u2026\u2026\u3002\u307e\u305f\u3064\u3089\u308c\u3066\u3057\u307e\u3044\u307e\u3057\u305f\u2026\u2026\u3002"}
{"prompt":"\u30cf\u30cf\u30cf\u30cf\u30cf\u30cf\uff01\uff01","completion":"\u697d\u3057\u305d\u3046\u3067\u4f55\u3088\u308a\u3067\u3059\uff01"}
{"prompt":"\u4eca\u65e5\u306e\u6669\u5fa1\u98ef\u306f\uff1f","completion":"\u3042\u3063\u3055\u308a\u304c\u3044\u3044\u3067\u3059\u304b\uff1f\u3000\u304c\u3063\u3064\u308a\u304c\u3044\u3044\u3067\u3059\u304b\uff1f"}

Fine TuningのTraningコスト概算を事前に算出してみる。

import tiktoken
from tiktoken.core import Encoding

all_str = ''.join(df.values.flatten())

encoding: Encoding = tiktoken.encoding_for_model("davinci")

tokens = encoding.encode(all_str)
tokens_count = len(tokens)

print("文字数: {}".format(len(all_str)))
print("トークン数: {}".format(tokens_count))
print("学習料金(概算): ${}".format(tokens_count / 1000 * 0.03))

んー、想定よりちょっと高く出る。tiktokenのモデルを変えると結構違ってくる。Fine Tuningのベースモデルは"davinci"になるはずなので、ちょっとここはよくわからない。

文字数: 18327
トークン数: 25637
学習料金(概算): $0.76911

APIキーを設定

OPENAI_API_KEY="ここにAPIキーを入力"#@param{type: "string"}

import os
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

このあとデータセットを検証するとファイルが作成されるのだけど、何度も繰り返すようなケースでは上書きしてくれるわけではないようなので、予め消すようにしておく。

!rm -f tsukuyomi_dataset_prepared.jsonl

データセットを検証

!openai tools fine_tunes.prepare_data -f tsukuyomi_dataset.jsonl -q

ちょいちょいPandasのバージョンによるWarningが出るけどとりあえず気にしなくてOK。

Analyzing...

- Your file contains 469 prompt-completion pairs
- There are 2 duplicated prompt-completion sets. These are rows: [438, 439]
- More than a third of your `prompt` column/key is uppercase. Uppercase prompts tends to perform worse than a mixture of case encountered in normal language. We recommend to lower case the data if that makes sense in your domain. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details
- More than a third of your `completion` column/key is uppercase. Uppercase completions tends to perform worse than a mixture of case encountered in normal language. We recommend to lower case the data if that makes sense in your domain. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- Your data does not contain a common ending at the end of your completions. Having a common ending string appended to the end of the completion makes it clearer to the fine-tuned model where the completion should end. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples.
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details

Based on the analysis we will perform the following actions:
- [Recommended] Remove 2 duplicate rows [Y/n]: Y
- [Recommended] Lowercase all your data in column/key `prompt` [Y/n]: Y
/usr/local/lib/python3.9/dist-packages/openai/validators.py:448: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x[column] = x[column].str.lower()
- [Recommended] Lowercase all your data in column/key `completion` [Y/n]: Y
- [Recommended] Add a suffix separator ` ->` to all prompts [Y/n]: Y
/usr/local/lib/python3.9/dist-packages/openai/validators.py:222: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x["prompt"] += suffix
- [Recommended] Add a suffix ending `\n` to all completions [Y/n]: Y
/usr/local/lib/python3.9/dist-packages/openai/validators.py:378: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x["completion"] += suffix
- [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: Y
/usr/local/lib/python3.9/dist-packages/openai/validators.py:421: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x["completion"] = x["completion"].apply(


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified file to `tsukuyomi_dataset_prepared.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "tsukuyomi_dataset_prepared.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=["\n"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 26.16 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.

元のファイル名に"_prepared.jsonl"をつけた検証済みデータセットファイルが生成される。中身はこんな感じ。

!head tsukuyomi_dataset_prepared.jsonl

{"prompt":"お腹が鳴る ->","completion":" 何か召し上がりますか？\n"}
{"prompt":"だるい ->","completion":" それは心配です。私にできることがあれば、何でもお申し付けください。\n"}
{"prompt":"眠れない ->","completion":" 何も心配する必要はありません。私がついています。ずっと、ずっと、一緒です。\n"}
{"prompt":"押すなよ、絶対押すなよ ->","completion":" はい、絶対押しません！\n"}
{"prompt":"見せられないよ ->","completion":" 目をつぶって待っています。\n"}
{"prompt":"恥を知れ ->","completion":" 申し訳ございません……。\n"}
{"prompt":"ちゅちゅちゅちゅちゅ、ちゅー ->","completion":" ねずみさんですか？\n"}
{"prompt":"じゃんけんぽん　あっち向いてホイ ->","completion":" あっ……。またつられてしまいました……。\n"}
{"prompt":"ハハハハハハ！！ ->","completion":" 楽しそうで何よりです！\n"}
{"prompt":"今日の晩御飯は？ ->","completion":" あっさりがいいですか？　がっつりがいいですか？\n"}

ではトレーニングさせる。

!openai api fine_tunes.create -t tsukuyomi_dataset_prepared.jsonl -m davinci

Fine Tuningのコストも表示されてるね。事前に見積もった感じよりもだいぶ低いのだけど、なんでこんなに違うのかな。

Upload progress: 100% 71.4k/71.4k [00:00<00:00, 68.0Mit/s]
Uploaded file from tsukuyomi_dataset_prepared.jsonl: file-WMxskcZ7KJq8mlTErqtnddnV
Created fine-tune: ft-kzsiIYeU8vjc0DO2G06eTLc1
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-04-01 01:57:38] Created fine-tune: ft-kzsiIYeU8vjc0DO2G06eTLc1
[2023-04-01 01:58:18] Fine-tune costs $3.21
[2023-04-01 01:58:18] Fine-tune enqueued. Queue number: 0
[2023-04-01 01:58:20] Fine-tune started

あと待ってるとこういうのが出力される場合がある。

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-kzsiIYeU8vjc0DO2G06eTLc1

Fine Tuningはキューに入って実行されるけど、コマンドラインの方は一旦接続が切られるみたい。出力されたとおりにコマンドを実行すると、現在の状態がわかる。

!openai api fine_tunes.follow -i ft-kzsiIYeU8vjc0DO2G06eTLc1

以下のような感じで完了すればOK。今回はだいたい4分ぐらいかかってる。

[2023-04-01 01:57:38] Created fine-tune: ft-kzsiIYeU8vjc0DO2G06eTLc1
[2023-04-01 01:58:18] Fine-tune costs $3.21
[2023-04-01 01:58:18] Fine-tune enqueued. Queue number: 0
[2023-04-01 01:58:20] Fine-tune started
[2023-04-01 02:03:31] Completed epoch 1/4
[2023-04-01 02:05:51] Completed epoch 2/4
[2023-04-01 02:08:10] Completed epoch 3/4
[2023-04-01 02:10:30] Completed epoch 4/4
[2023-04-01 02:11:16] Uploaded model: davinci:ft-personal-2023-04-01-02-11-16
[2023-04-01 02:11:18] Uploaded result file: file-wuNe4ws9f8XwFYtKAbzoksML
[2023-04-01 02:11:18] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tuned model:

openai api completions.create -m davinci:ft-personal-2023-04-01-02-11-16 -p <YOUR_PROMPT>

ではFine Tuningしたモデルで実行してみる。

from langchain.llms import OpenAI

llm = OpenAI(model_name="davinci:ft-personal-2023-04-01-02-11-16", n=3, best_of=3, model_kwargs={"stop":'\n'})
result = llm.generate(["だるい->"])
result.generations

[[Generation(text='それは心配です。私にできることがあれば、何でもお申し付けください。', generation_info={'finish_reason': None, 'logprobs': None}),
  Generation(text='それは心配です。私がお力になれることはありませんが、是非ともお声を掛けてください。', generation_info={'finish_reason': None, 'logprobs': None}),
  Generation(text='それでも私は、あなたを愛しています。', generation_info={'finish_reason': 'stop', 'logprobs': None})]]

データセットにないようなものも聞いてみる。

from langchain.llms import OpenAI

llm = OpenAI(model_name="davinci:ft-personal-2023-04-01-02-11-16", n=3, best_of=3, model_kwargs={"stop":'\n'})
result = llm.generate(["日本の総理大臣は->"])
result.generations

[[Generation(text=' 選挙で選ばれた方ですか？', generation_info={'finish_reason': 'stop', 'logprobs': None}),
  Generation(text=' あなたが望まれるようにお考えになっている方がいらっしゃいますよ。', generation_info={'finish_reason': None, 'logprobs': None}),
  Generation(text=' 日本国憲法に基づいて選ばれた方ですか？', generation_info={'finish_reason': 'stop', 'logprobs': None})]]

んー、そういう風になるのか。あとここには書いてないけど、全然おかしな回答とかも返ってきたりしてた。

所感①

Fine Tuningの手順はサラッと見てはいたけど、思ってたよりも簡単だった。なんとなくそれっぽい回答にもなってるとは思う。

ただ、Fine Tuningをやる前の勝手なイメージでは、ベースモデルの内容はそれなりに維持されてる上でFine Fine Tuningの情報が追加されると思ってたんだけど、

Fine-tuning is currently only available for the following base models: davinci, curie, babbage, and ada. These are the original models that do not have any instruction following training (like text-davinci-003 does for example).

とあり、ベースモデルからトレーニング済のtext-davinci-003あたりの精度をイメージしてたので、ちょっと精度的にイマイチのイメージを持った。

ただ今回のデータセットが自分の想定に対して適切なものなのか？というとチョット違う気もするし、まだ良くわかってない点も多いので、もう少し検証の必要がありそう。

所感②

自分の中ではFine Tuning vs LangChainのVectorDBQAChainみたいなイメージだったのだけど、

そもそもデータセットの構造も違えばやり方も違うので、この比較は意味がなかった。もしVectorDBQAChainと比較するならばFine Tuningではなく以下になると思う。

まとめ

以下にもあるが、

Fine-tuning と Prompt Design については二者択一の議論ではありません。組み合わせて使用することも十分可能です。しかし、どちらかを選択する場合があると思うので（半ば無理矢理） Fine-tuning と Prompt Design を比較してみます。どれも概論的な話であり、あくまでもケースバイケースであることに注意してください。

念押しですが、ケースバイケースなのであくまでも参考程度に留めてください。また、「やってみないと分からない」ことが多いのも確かです。いち早く作って試してみましょう！

まさにそのとおりで、これは結構難しい。ただ

Fine-tuningは構成がシンプルで済むし作業も簡単。ただ精度はやってみないとわからないところがありそう。
Embeddings/Indexes＋Promptは構成がやや複雑になるけど、ある程度結果をコントロールできそうで、実装もそこまで難しいものではなさそう。

というのが（あくまでも）個人的な印象。結局のところ、どちらになるにせよ、自前データの数・質も重要になるし、実際にやってみて期待した精度・結果が得られるかを試すしかないと思う。

補足

検証したデータセットに"->"が入るのはそういうものなのかな？

このスクラップは2023/04/01にクローズされました