Open6
openai evals
これを試してみる
これでいける
事前手順
-
git-lfs
をインストール
$ brew install git-lfs
$ git-lfs --version
git-lfs/3.5.1 (GitHub; darwin arm64; go 1.22.1)
セットアップ
$ git clone https://github.com/openai/evals.git
$ cd evals
$ git lfs install
$ git lfs fetch --all
$ git lfs pull
python環境の構築が大変すぎるので、rye試してみる
- ryeのインストール
$ curl -sSf https://rye.astral.sh/get | bash
$ source ~/.profile
$ which rye
/Users/yuki.nagae/.rye/shims/rye
$ rye --version
rye 0.39.0
commit: 0.39.0 (bf3ccf818 2024-08-21)
platform: macos (aarch64)
self-python: cpython@3.12.5
symlink support: true
uv enabled: true
対象プロジェクト(cd evals
)でryeをセットアップ
- python 3.12をプロジェクトにpin
$ rye pin 3.12
pinned 3.12.5 in /Users/yuki.nagae/repos/evals/.python-version
$ python --version
Python 3.12.5
$ cat .python-version # ← `.python-version` が新規作成される
3.12.5
- 仮想環境の構築 + 依存するライブラリのインストール
- venvで仮想環境の構築
-
requirements.lock
とrequirements-dev.lock
が新規作成される
$ rye sync
Initializing new virtualenv in /Users/yuki.nagae/repos/evals/.venv
Python version: cpython@3.12.5
Generating production lockfile: /Users/yuki.nagae/repos/evals/requirements.lock
Generating dev lockfile: /Users/yuki.nagae/repos/evals/requirements-dev.lock
Installing dependencies
Resolved 185 packages in 58ms
Built evals @ file:///Users/yuki.nagae/repos/evals
Built fire==0.6.0
Built spacy-universal-sentence-encoder==0.4.6
Built langdetect==1.0.9
Built spacy==3.7.6
Prepared 184 packages in 3m 45s
Installed 185 packages in 2.36s
+ absl-py==2.1.0
(省略)
Done!
openai evalsのダミーデータで動作確認: oaieval gpt-3.5-turbo test-match
$ export OPENAI_API_KEY=your-api-key
$ rye run oaieval gpt-3.5-turbo test-match
[2024-08-30 13:47:35,886] [registry.py:271] Loading registry from /Users/yuki.nagae/repos/evals/evals/registry/evals
[2024-08-30 13:47:36,342] [registry.py:271] Loading registry from /Users/yuki.nagae/.evals/evals
[2024-08-30 13:47:36,344] [oaieval.py:215] Run started: 2408300447367YPEOHXL
[2024-08-30 13:47:41,117] [eval.py:36] Evaluating 3 samples
[2024-08-30 13:47:41,208] [eval.py:144] Running in threaded mode with 10 threads!
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3.60it/s]
[2024-08-30 13:47:42,052] [oaieval.py:275] Found 3/3 sampling events with usage data
[2024-08-30 13:47:42,052] [oaieval.py:283] Token usage from 3 sampling events:
completion_tokens: 14
prompt_tokens: 80
total_tokens: 94
[2024-08-30 13:47:42,056] [record.py:371] Final report: {'accuracy': 0.6666666666666666, 'boostrap_std': 0.4680000000000001, 'usage_completion_tokens': 14, 'usage_prompt_tokens': 80, 'usage_total_tokens': 94}. Logged to /tmp/evallogs/2408300447367YPEOHXL_gpt-3.5-turbo_test-match.jsonl
[2024-08-30 13:47:42,056] [oaieval.py:233] Final report:
[2024-08-30 13:47:42,056] [oaieval.py:235] accuracy: 0.6666666666666666
[2024-08-30 13:47:42,056] [oaieval.py:235] boostrap_std: 0.4680000000000001
[2024-08-30 13:47:42,056] [oaieval.py:235] usage_completion_tokens: 14
[2024-08-30 13:47:42,056] [oaieval.py:235] usage_prompt_tokens: 80
[2024-08-30 13:47:42,056] [oaieval.py:235] usage_total_tokens: 94
[2024-08-30 13:47:42,059] [record.py:360] Logged 6 rows of events to /tmp/evallogs/2408300447367YPEOHXL_gpt-3.5-turbo_test-match.jsonl: insert_time=1.681ms
accuracyが0.66だから、2/3正解って感じか。
この記事だと、accuracy1.0になってるので、使っているデータセット違うのかな 🤔