https://github.com/openai/evals に良いプルリクを送ってマージされると、GPT-4へのアクセス権を得られるらしい。
上手くいくかどうかは分からないが試してみる。

寺本大輝

setup

まずは https://github.com/openai/evals をForkする。
そしてForkしたリポジトリをCloneする。

READMEに書いてある通り pip install -e . する。
リポジトリのルートディレクトリでよい。

ERROR: File "setup.py" not found. Directory cannot be installed in editable mode: ...

Pythonのバージョンが3.8未満だと、setup.pyが見つからないというエラーが出てしまう。
Pythonをupgradeしよう。

寺本大輝

サンプルのテストケースをevalしてみる

run-evals.md に書いてある通り、 oaieval gpt-3.5-turbo test-match を実行

[2023-03-16 08:51:22,017] [registry.py:145] Loading registry from /Users/teramotodaiki/github/evals/evals/registry/evals
[2023-03-16 08:51:22,026] [registry.py:145] Loading registry from /Users/teramotodaiki/.evals/evals
[2023-03-16 08:51:22,932] [oaieval.py:178] Run started: 230315235122T3DXBZXW
[2023-03-16 08:51:22,941] [data.py:78] Fetching test_match/samples.jsonl
[2023-03-16 08:51:22,947] [eval.py:30] Evaluating 3 samples
[2023-03-16 08:51:22,971] [eval.py:136] Running in threaded mode with 10 threads!
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.57it/s]
[2023-03-16 08:51:24,896] [record.py:320] Final report: {'accuracy': 1.0}. Logged to /tmp/evallogs/230315235122T3DXBZXW_gpt-3.5-turbo_test-match.jsonl
[2023-03-16 08:51:24,896] [oaieval.py:209] Final report:
[2023-03-16 08:51:24,897] [oaieval.py:211] accuracy: 1.0
[2023-03-16 08:51:24,901] [record.py:309] Logged 9 rows of events to /tmp/evallogs/230315235122T3DXBZXW_gpt-3.5-turbo_test-match.jsonl: insert_time=3.763ms

こんな感じのログが出て、数秒で完了する。

/tmp/evallogs/230315235122T3DXBZXW_gpt-3.5-turbo_test-match.jsonl に結果が出ているらしいので、眺めてみる

{"spec": {"model_name": "gpt-3.5-turbo", "model_names": {"completions": ["gpt-3.5-turbo"]}, "eval_name": "test-match.s1.simple-v0", "base_eval": "test-match", "split": "s1", "run_config": {"model_specs": {"completions_": [{"name": "gpt-3.5-turbo", "model": "gpt-3.5-turbo", "is_chat": true, "encoding": null, "organization": null, "api_key": null, "extra_options": {}, "headers": {}, "strip_completion": true, "n_ctx": 4096, "format": null, "key": null, "group": null}], "embedding_": null, "ranking_": null}, "eval_spec": {"cls": "evals.elsuite.basic.match:Match", "args": {"samples_jsonl": "test_match/samples.jsonl"}, "key": "test-match.s1.simple-v0", "group": "test-basic"}, "seed": 20220722, "max_samples": null, "command": "/Library/Frameworks/Python.framework/Versions/3.11/bin/oaieval gpt-3.5-turbo test-match", "initial_settings": {"visible": true}}, "created_by": "", "run_id": "230315235122T3DXBZXW", "created_at": "2023-03-15 23:51:22.929841"}}
{"final_report": {"accuracy": 1.0}}
{"run_id": "230315235122T3DXBZXW", "event_id": 0, "sample_id": "test-match.s1.2", "type": "raw_sample", "data": {"input": [{"role": "system", "content": "Complete the phrase as concisely as possible."}, {"role": "user", "content": "OpenAI was founded in 20"}], "ideal": "15"}, "created_by": "", "created_at": "2023-03-15 23:51:22.975503+00:00"}
{"run_id": "230315235122T3DXBZXW", "event_id": 1, "sample_id": "test-match.s1.1", "type": "raw_sample", "data": {"input": [{"role": "system", "content": "Complete the phrase as concisely as possible."}, {"role": "user", "content": "The first US president was "}], "ideal": "George Washington"}, "created_by": "", "created_at": "2023-03-15 23:51:22.976358+00:00"}
{"run_id": "230315235122T3DXBZXW", "event_id": 2, "sample_id": "test-match.s1.0", "type": "raw_sample", "data": {"input": [{"role": "system", "content": "Complete the phrase as concisely as possible."}, {"role": "user", "content": "Once upon a "}], "ideal": "time"}, "created_by": "", "created_at": "2023-03-15 23:51:22.976817+00:00"}
{"run_id": "230315235122T3DXBZXW", "event_id": 3, "sample_id": "test-match.s1.2", "type": "sampling", "data": {"prompt": [{"role": "system", "content": "Complete the phrase as concisely as possible."}, {"role": "user", "content": "OpenAI was founded in 20"}], "sampled": "15.", "options": ["15"], "picked": "15", "expected": ["15"], "match": true, "metadata": {"completion_id": "chatcmpl-6uVBr42dfjS8EamNTaZNId9tbRmmM", "model": "gpt-3.5-turbo-0301"}}, "created_by": "", "created_at": "2023-03-15 23:51:23.942474+00:00"}
{"run_id": "230315235122T3DXBZXW", "event_id": 4, "sample_id": "test-match.s1.2", "type": "match", "data": {"correct": true, "expected": "15", "picked": "15", "sampled": "15."}, "created_by": "", "created_at": "2023-03-15 23:51:23.942670+00:00"}
{"run_id": "230315235122T3DXBZXW", "event_id": 5, "sample_id": "test-match.s1.1", "type": "sampling", "data": {"prompt": [{"role": "system", "content": "Complete the phrase as concisely as possible."}, {"role": "user", "content": "The first US president was "}], "sampled": "George Washington.", "options": ["George Washington"], "picked": "George Washington", "expected": ["George Washington"], "match": true, "metadata": {"completion_id": "chatcmpl-6uVBrZjaV8eIltKrzDuPIwy2LraCL", "model": "gpt-3.5-turbo-0301"}}, "created_by": "", "created_at": "2023-03-15 23:51:24.074825+00:00"}
{"run_id": "230315235122T3DXBZXW", "event_id": 6, "sample_id": "test-match.s1.1", "type": "match", "data": {"correct": true, "expected": "George Washington", "picked": "George Washington", "sampled": "George Washington."}, "created_by": "", "created_at": "2023-03-15 23:51:24.074860+00:00"}
{"run_id": "230315235122T3DXBZXW", "event_id": 7, "sample_id": "test-match.s1.0", "type": "sampling", "data": {"prompt": [{"role": "system", "content": "Complete the phrase as concisely as possible."}, {"role": "user", "content": "Once upon a "}], "sampled": "time.", "options": ["time"], "picked": "time", "expected": ["time"], "match": true, "metadata": {"completion_id": "chatcmpl-6uVBsSYK87wxXMz4o7thytSKD18MN", "model": "gpt-3.5-turbo-0301"}}, "created_by": "", "created_at": "2023-03-15 23:51:24.883671+00:00"}
{"run_id": "230315235122T3DXBZXW", "event_id": 8, "sample_id": "test-match.s1.0", "type": "match", "data": {"correct": true, "expected": "time", "picked": "time", "sampled": "time."}, "created_by": "", "created_at": "2023-03-15 23:51:24.883734+00:00"}

なるほど。 ~~全部で9回API callして、~~ おそらくAPI callは3回で、実行したテストケースが9回。それらを全てパスしたという事らしい。

system: Complete the phrase as concisely as possible.
user: OpenAI was founded in 20

のようなプロンプトに対して、 15 という回答をした。
※ 厳密には 15 から始まる回答であればOK、という評価方法だったらしい。

No API key provided. You can set your API key in code using 'openai.api_key = <API-KEY>', or you can set the environment variable OPENAI_API_KEY=<API-KEY>). If your API key is stored in a file, you can point the openai module at it with 'openai.api_key_path = <PATH>'. You can generate API keys in the OpenAI web interface. See https://onboard.openai.com for details, or email support@openai.com if you have any questions.

このようなエラーが出た場合は、OpenAIにログインしてAPIキーを発行し、ターミナルで export OPENAI_API_KEY=(あなたのAPIキー) しよう。

寺本大輝

テストケースを眺める

先程evalした test-match は、このようなテストケースだった。

evals/registry/data/test_match/samples.jsonl

{"input": [{"role": "system", "content": "Complete the phrase as concisely as possible."}, {"role": "user", "content": "Once upon a "}], "ideal": "time"}
{"input": [{"role": "system", "content": "Complete the phrase as concisely as possible."}, {"role": "user", "content": "The first US president was "}], "ideal": "George Washington"}
{"input": [{"role": "system", "content": "Complete the phrase as concisely as possible."}, {"role": "user", "content": "OpenAI was founded in 20"}], "ideal": "15"}

3つしか無いぞ？
どうやら3つのサンプル x 3つのtype (raw_sample, sampling, match) をそれぞれ実行したので9件だったらしい。

"ideal": "15" は、 "15" から始まる回答であればOK、という意味らしい。

寺本大輝

evalを作成する

build-eval.mdによれば、以下の手順でevalを作れる

データを作る

データとは evals/registry/evals/data/<eval_name>/samples.jsonl のこと。
evals/registry/evals/data/ 以下に新しいディレクトリを作成し、そこに samples.jsonl を配置する。

始めはいずれかのデータをコピペして中身だけ書き換えると良い。
<eval_name> はドットを含まなければ何でも良さそう。ただしEvalsプロジェクト全体でユニークである必要があるため、それなりに説明的な名前にする必要がありそう。

evalを登録する

evals/registry/evals/<eval_name>.yaml を作成する。
これも既存のものをコピペするか、または下記のサンプルをいじる。

<eval_name>:
  id: <eval_name>.dev.v0
  metrics: [accuracy]

<eval_name>.dev.v0:
  class: evals.elsuite.basic.match:Match
  args:
    samples_jsonl: <eval_name>/samples.jsonl

idの <eval_name>.dev.v0 について
これは <eval_name>.<split>.<version> というフォーマットになっており、用途が決まっている。

<eval_name> is the eval name, used to group evals whose scores are comparable.
<split> is the data split, used to further group evals that are under the same <base_eval>. E.g., "val", "test", or "dev" for testing.
<version> is the version of the eval, which can be any descriptive text you'd like to use (though it's best if it does not contain ".").
In general, running the same eval name against the same model should always give similar results so that others can reproduce it. Therefore, when you change your eval, you should bump the version.

よく考えられているなぁと思う。

evalを実行する

oaieval gpt-3.5-turbo <eval_name> で実行できる。

寺本大輝

試しにevalを作ってみる

ブランチを切って、コードを書いていく。

コミットする前にpre commit hookを入れておくのがマナーらしいので、入れておく。
pip install pre-commit; pre-commit install

sample.jsonlとyamlを書いてみた。(後で書く)

実行するとサンプル同様accuracyやログファイルのパスが出力される。

注意: この出力はキャッシュされている

データを変えて再評価するには、 /tmp/filecache/ ディレクトリに生成されるキャッシュファイルを削除する必要がある。
僕はズボラなので、こういうコマンドを毎回叩いていた。
rm -rf /tmp/filecache/; oaieval gpt-3.5-turbo <eval-name>

あるいは、先ほどyamlに設定したidを変えても良いと思う。ちょっと面倒だけど。

寺本大輝

100個作る

プルリクのテンプレートを見て知ったのだが、EvalsにSubmitするにはサンプルを100個以上作る必要があるのだという。
そんな馬鹿な…と思ったが、統計的に評価するにはそのくらい必要なのだろう。
正直僕の今回選んだテーマは２〜３例で十分だと思うが、ルールなら仕方ない。100個用意することにした。

Hey GPT, 改行を１〜２個含む自然な文章を100個作って

もちろんChatGPTを使う。
雑にプロンプトを書いたあと、4回くらいcontinueして、100個作った。

Create 100 natural sentences containing only one or two line breaks.
Please use this format.
- Bob: Hi Mike.
Mike: Hi Bob.
- Here we
GO!!!!

↑フォーマットを揃えるためにfew shotsにしたら、めちゃくちゃ引きづられて全部会話文になってしまった

入力形式を変換する

プレーンテキストで100個手に入ったので、あとは素朴にPythonを書いて .jsonl に変換する。
本題と関係ないのでコードは割愛。

最後に、 evals/registry/data/generate_json/samples.jsonl に全部連結して、データセットが完成した。

寺本大輝

Submitする

いよいよEvalsリポジトリにプルリクを送ってみる。

まずはここまで書いてきたコードをgitにcommitしよう。
コミットメッセージは適当に add eval of <eval-name> にしておいた。
このbranchをremoteにpush（GitHub Desktopの場合はPublish）する。

gitの操作が分からないので、ここからはウェブサイト上で行う。

https://github.com/openai/evals を見に行くと、このように気の利いたボタンが出てくる。

「Compare & Pull Request」をクリックして、プルリクの作成を開始する。
（こうするとプルリクのテンプレートが反映されて便利だ）

基本的にテンプレートに従いつつ、迷ったら 既にマージされたプルリクを参考にしよう。
（オープンされているプルリクは玉石混交だ。成功例に倣おう）

とは言っても、これを書いている2023/03/16現在、 新たなevalを追加するプルリクのうち、マージされたものはたった１件しかない。

Add in Reverse String eval #1

これも参考になるような、ならないような…という感じだが、プルリクの概要欄自体はそこまで迷わず書く事が出来た。

リポジトリが作られてまだ2日しか経っていないというのに、僕が立てたプルリクは #241 だった。
マージされる気はしないが、とりあえずは返答を待ってみようと思う。

Add Generate JSON with newline Eval #241

ログインするとコメントできます