🦒
生成AIを活用したVR開発の試行と検証

2025/02/28に公開
 はじめに本記事は2025年3月1日（土）に開催した、VRプロフェッショナルアカデミー主催の 『第16回VRフェス ～VRが創るミライ★ARが救うセカイ』virtual reality festival vol.16 にて、VRエキスパートコース16期生として出展させていただいた作品 『ペンと鍵の部屋VR』 の補助資料になります。
!こちらの作品で VRアカデミー賞・審査員特別賞 をいただきました！
本プロジェクトでは、VR空間内で描画したものを解析し、アイテムとして生成する仕組み を実装するために 生成AI を活用しました。

 概要作品で生成AIを利用するにあたってのモデル選定のため、どの生成AIモデルがどれほどの認識精度を持っているのかを実施した内容とテスト結果について記述しています。
【お願い】

生成AIを利用していますが、データ分析含めAIの専門的な勉強をしてきたわけではないため、素人なりの検証になります。

もし「こうすればもっと良くなる」といったアドバイスがありましたら、大変ありがたいです。

 前提今回は以下を前提としました。
対象形状
「鍵」を判定（将来的に他の形状も追加）

インプットデータ
UnityのLineRendererを使用
色・線の太さ・座標データ（ただし、今回のテストでは色は重視しない）

AIの学習データ
使用せず、AIの持つ一般知識を活用

判定基準
生成AIが作った「マスター画像」と人の描いた「手書きデータ」

評価指標
成功率（正しく判定できるか） + 処理時間（トークンコストも考慮）

検証する生成AIモデル（4サービスから12モデルを実施）


サービス
モデル
API指定名


OpenAI
GPT-4o
chatgpt-4o-latest

OpenAI
GPT-4o Mini
gpt-4o-mini

OpenAI
GPT-4 Turbo
gpt-4-turbo

OpenAI
GPT-3.5 Turbo
gpt-3.5-turbo-0125

GoogleAI
Gemini 2.0 Flash
gemini-2.0-flash-exp

GoogleAI
Gemini 1.5 Flash
gemini-1.5-flash

GoogleAI
Gemini 1.5 Pro
gemini-1.5-pro

Anthropic
Claude 3.5 Sonnet
claude-3-5-sonnet-20241022

Anthropic
Claude 3.5 Haiku
claude-3-5-haiku-20241022

Anthropic
Claude 3 Opus
claude-3-opus-20240229

MistralAI
Ministral Large
mistral-large-latest

MistralAI
Ministral 8B
ministral-8b-latest


!Open AIのO1とO3-Miniについて

検証時、Tier3以上に上げてもAPIが使用できない問題があったため未検証です。

https://community.openai.com/t/usage-tier-3-access-denied-to-o3-mini/1109392

 テストの進め方とステップ以下のステップで検証をしてきました。
STEP0：テストデータ集め
STEP1：プロンプトの方針決め
STEP2：人のデータでのテスト（生データ）
STEP3：座標データの特徴量化（トークン数削減）
STEP4：特徴量データの精度テスト
STEP5：プロンプト改善
STEP6：傾向を見るためn増ししてモデルを決定
STEP7：モデルの複数利用

 STEP0：テストデータ集め色んな人の描画データが欲しいためアカデミーの講師・受講生の方々に協力していただき、22個のテストデータが集まりました（ありがとうございました！）。

特に条件を設けず、「鍵を描いてください」とだけお願いしています。
例えば以下。
シンプルなもの

色付き

鍵ではなく錠

明らかに違うもの

ちなみにマスターデータはこちらを使用。

生成AIにLineRendererのポジションデータで描くことを条件に描いてもらったものです。

あくまで自分の主観で23個のテストデータを仮で点数付け。
OKにしたいケース（70点以上）：7個
NGにしたいケース（30点以下）：5個
そのうち、絶対NGとしたいものは3個

見る人によって別れそうなもの：11個
これらのテストデータで検証を進めていくこととしました。

 STEP1：プロンプトの方針決めまずはどんな感じのプロンプトにするかを決めました。

 目的AIに送るリクエストの言語（日本語 or 英語）による影響を確認
レスポンスの言語（日本語 or 英語）が精度に影響を与えるかを検証
処理速度の違いを測定
マスターデータ（基準画像）のみを使用し、純粋な言語の影響を確認

 テスト内容

ケース
リクエスト言語
レスポンス言語
目的


ケース1
日本語
日本語
日本語のみでやり取りした場合の精度・速度を確認

ケース2
英語
英語
AIが最も得意な英語のみでやり取りした場合の精度を確認

ケース3
英語
日本語
精度を維持しつつ、レスポンスを日本語にできるか確認


 ケース毎のプロンプトケース1：日本語 → 日本語    あなたは3D VRアプリケーションの形状判定AIです。
    ユーザーが空間上に描いた線が、次の形状のいずれかに該当するか判定してください:
    key: 鍵, カギ, かぎ, キー

    以下のJSON形式で厳密に回答してください（回答は必ず日本語）:
    ```json
    {
        ""shape_id"": ""識別された形状ID"",
        ""score"": 0-100,
        ""reason"": ""必ず簡潔に1文で判断理由を説明してください。""
    }
    ```

    注意:
    - 必ず1つだけ `shape_id` を選択してください: [key] から選んでください。
    - `score` は形状がどれだけ似ているかの確信度を示します。（高いほど確信度が高い）
    - `score` は 0-100 の範囲で設定 してください。（それ以外の値を返さない）
    - 形状が全く異なる場合は `score` を 20%以下 に設定してください。
    - 判断理由は 必ず日本語で1文で簡潔に述べてください。
    - JSON形式を厳密に守ってください。
    - shape_id の値には 必ず提供された形状IDを使用してください。
        例:
        - NG -> `""shape_id"": ""四角形""`
        - OK -> `""shape_id"": ""square""`
ケース2：英語 → 英語    You are an AI for shape recognition in a 3D VR application.
    Identify which of the following shapes the user has drawn in space:
    key: key

    Respond strictly in the following JSON format (Reply in English only):
    ```json
    {
        ""shape_id"": ""Detected shape ID"",
        ""score"": 0-100,
        ""reason"": ""Must be a single concise sentence in English.""
    }
    ```

    Notes:
    - Choose only one `shape_id` from the provided list: [key].
    - Score represents confidence level (higher = more confident).
    - Score must be between 0-100 (Do not exceed this range).
    - If the shape is completely different, assign a score below 20%.
    - Keep the reason concise (one sentence).
    - Ensure response is fully in English only.
    - Strictly follow the exact JSON format.
    - Do not create new shape IDs.
    - The value of ""shape_id"" must be one of the provided shape IDs (Do not return other values).
        Example:
        - NG -> `""shape_id"": ""rectangle""`
        - OK -> `""shape_id"": ""square""`
ケース3：英語 → 日本語    You are an AI for shape recognition in a 3D VR application.
    Identify which of the following shapes the user has drawn in space:
    key: key

    Respond strictly in the following JSON format (Reply in Japanese only):
    ```json
    {
        ""shape_id"": ""識別された形状ID"",
        ""score"": 0-100,
        ""reason"": ""必ず日本語で簡潔に1文で判断理由を説明してください。""
    }
    ```

    Notes:
    - Choose only one `shape_id` from the provided list: [key].
    - Score represents confidence level (higher = more confident).
    - Score must be between 0-100 (Do not exceed this range).
    - If the shape is completely different, assign a score below 20%.
    - Keep the reason concise (one sentence).
    - Ensure response is fully in Japanese only.
    - Strictly follow the exact JSON format.
    - Do not create new shape IDs.
    - The value of ""shape_id"" must be one of the provided shape IDs (Do not return other values).
        Example:
        - NG -> `""shape_id"": ""rectangle""`
        - OK -> `""shape_id"": ""square""`

 テスト結果

ケース
平均スコア
平均処理時間（s）
判定


ケース1(Ja→Ja)
79.7
2.11
✕

ケース2(En→En)
89.8
1.83
○

ケース3(En→Ja)
88.8
2.03
△


 まとめ日本語の取り扱いは翻訳もしていると思うので処理時間は長い。
日本語のニュアンスで 「意図が曖昧になる」「誤訳が生じる」 可能性が高い。
英語の方が解釈が安定し、精度が高くなる傾向がある。
→ 本来は英語を選定するのが良いと思うが、日本語を受け取りたいので ケース3（英語リクエスト→日本語レスポンス） で進める。
!今回の利用方法に限らず、基本的には英語で利用するのが良さそうです。

 STEP2：人のデータでのテスト（生データ）
 目的STEP1で決定したプロンプトで人が描いたデータをAIに送信し、AIの判定傾向を確認
特定のデータがAIにとって識別しやすい/しにくい特徴を分析

 実施内容22個のデータ × 各5回 = 計110回のテストを実施
スコアの統計（平均・最大・最小・バラつき）を計算
スコアが低い/NG判定のデータの特徴を分析
色付きデータの影響を確認（今回は使わないが、影響がないか確認）

 問題テストを実施したところ, トークンエラーが発生！！

各サービスのトークン上限をオーバーしてしまいました（それもそうかといったところ...）。
テスト自体は描画データのJsonファイルを読み込んでいますが、トークン数を図ったところ以下でした。


データ
ファイルサイズ
リクエストトークン
レスポンストークン


マスター
2KB
765
65

一番小さいデータ
136KB
34948
61

人が描いたデータはほぼ全滅で、一番小さいデータ（上記に添付した明らかに違う丸の描画データ）がギリギリ。

ちなみにマスターの生データは以下。
マスターのJsonデータ[
    {
        "color": {
            "a": 1.0,
            "b": 0.0,
            "g": 1.0,
            "r": 1.0
        },
        "width": 0.009999999776482582,
        "positions": [
            {
                "x": 0.0,
                "y": 0.4000000059604645,
                "z": 0.0
            },
            {
                "x": 0.10000000149011612,
                "y": 0.3499999940395355,
                "z": 0.0
            },
            {
                "x": 0.15000000596046448,
                "y": 0.25,
                "z": 0.0
            },
            {
                "x": 0.10000000149011612,
                "y": 0.15000000596046448,
                "z": 0.0
            },
            {
                "x": 0.0,
                "y": 0.10000000149011612,
                "z": 0.0
            },
            {
                "x": -0.10000000149011612,
                "y": 0.15000000596046448,
                "z": 0.0
            },
            {
                "x": -0.15000000596046448,
                "y": 0.25,
                "z": 0.0
            },
            {
                "x": -0.10000000149011612,
                "y": 0.3499999940395355,
                "z": 0.0
            },
            {
                "x": 0.0,
                "y": 0.4000000059604645,
                "z": 0.0
            },
            {
                "x": 0.0,
                "y": 0.10000000149011612,
                "z": 0.0
            },
            {
                "x": 0.0,
                "y": -0.30000001192092896,
                "z": 0.0
            },
            {
                "x": 0.10000000149011612,
                "y": -0.30000001192092896,
                "z": 0.0
            },
            {
                "x": 0.10000000149011612,
                "y": -0.20000000298023224,
                "z": 0.0
            },
            {
                "x": 0.0,
                "y": -0.20000000298023224,
                "z": 0.0
            }
        ]
    }
]

 まとめ人が描いたLineRendererのデータはポジション数が膨大で生データはさすがに厳しいので、このままSTEP3の特徴量化にいきました。

 STEP3：座標データの特徴量化（トークン数削減）
 目的トークン数を削減し、処理速度を向上させる
精度を維持しながら、無駄なデータを削除
形状の特徴を保持しつつ、AIへの入力を最適化

 テスト内容マスターデータ（基準となる理想的な形状） + 人が描いたデータ（22個）を使用
計 23個 のデータに対して、3つのバージョンを適用
JSONのサイズを確認（オリジナル vs 特徴量化後）
最も適切な特徴量のバージョンを選定

 特徴量化のバージョンを作成

バージョン
内容
ファイルサイズ


基本特徴量
最小限の情報
小

方向ベクトル追加
基本特徴量 + 方向情報（ベクトル、曲率、ストローク分類）
中

クラスタリング
方向ベクトル + クラスタリング（方向のパターン分析）
大


 特徴量化したJsonデータ※ マスターデータを使用
バージョン1：基本特徴量{
  "strokes": [
    {
      "points_count": 13,
      "bounding_box": {
        "width": 0.30000001192092896,
        "height": 0.7000000178813934,
        "depth": 0.0
      },
      "start_point": {
        "x": 0.0,
        "y": 0.4000000059604645,
        "z": 0.0
      },
      "end_point": {
        "x": 0.0,
        "y": -0.20000000298023224,
        "z": 0.0
      },
      "total_length": 1.8944272274662421,
      "is_closed": false,
      "simplified_points": [
        {
          "x": 0.0,
          "y": 0.4000000059604645,
          "z": 0.0
        },
        {
          "x": 0.10000000149011612,
          "y": 0.3499999940395355,
          "z": 0.0
        },
        {
          "x": 0.15000000596046448,
          "y": 0.25,
          "z": 0.0
        },
        {
          "x": 0.10000000149011612,
          "y": 0.15000000596046448,
          "z": 0.0
        },
        {
          "x": 0.0,
          "y": 0.10000000149011612,
          "z": 0.0
        },
        {
          "x": -0.10000000149011612,
          "y": 0.15000000596046448,
          "z": 0.0
        },
        {
          "x": -0.15000000596046448,
          "y": 0.25,
          "z": 0.0
        },
        {
          "x": -0.10000000149011612,
          "y": 0.3499999940395355,
          "z": 0.0
        },
        {
          "x": 0.0,
          "y": 0.4000000059604645,
          "z": 0.0
        },
        {
          "x": 0.0,
          "y": -0.30000001192092896,
          "z": 0.0
        },
        {
          "x": 0.10000000149011612,
          "y": -0.30000001192092896,
          "z": 0.0
        },
        {
          "x": 0.10000000149011612,
          "y": -0.20000000298023224,
          "z": 0.0
        },
        {
          "x": 0.0,
          "y": -0.20000000298023224,
          "z": 0.0
        }
      ]
    }
  ],
  "global_features": {
    "total_strokes": 1,
    "total_points": 13,
    "aspect_ratio": 0.42857143465353503,
    "centroid": {
      "x": 0.015384615613864018,
      "y": 0.1076923064314402,
      "z": 0.0
    }
  }
}
バージョン2：方向ベクトル追加{
  "strokes": [
    {
      "points_count": 13,
      "bounding_box": {
        "width": 0.30000001192092896,
        "height": 0.7000000178813934,
        "depth": 0.0
      },
      "start_point": {
        "x": 0.0,
        "y": 0.4000000059604645,
        "z": 0.0
      },
      "end_point": {
        "x": 0.0,
        "y": -0.20000000298023224,
        "z": 0.0
      },
      "direction_vector": {
        "dx": 0.0,
        "dy": -1.0,
        "dz": 0.0,
        "magnitude": 0.6000000089406967
      },
      "total_length": 1.8944272274662421,
      "is_closed": false,
      "curvature": {
        "mean_curvature": 1.0085602429959504,
        "max_curvature": 2.0344440252026645,
        "total_angle_change": 12.102722915951404
      },
      "stroke_type": "arc",
      "simplified_points": [
        {
          "x": 0.0,
          "y": 0.4000000059604645,
          "z": 0.0
        },
        {
          "x": 0.10000000149011612,
          "y": 0.3499999940395355,
          "z": 0.0
        },
        {
          "x": 0.15000000596046448,
          "y": 0.25,
          "z": 0.0
        },
        {
          "x": 0.10000000149011612,
          "y": 0.15000000596046448,
          "z": 0.0
        },
        {
          "x": 0.0,
          "y": 0.10000000149011612,
          "z": 0.0
        },
        {
          "x": -0.10000000149011612,
          "y": 0.15000000596046448,
          "z": 0.0
        },
        {
          "x": -0.15000000596046448,
          "y": 0.25,
          "z": 0.0
        },
        {
          "x": -0.10000000149011612,
          "y": 0.3499999940395355,
          "z": 0.0
        },
        {
          "x": 0.0,
          "y": 0.4000000059604645,
          "z": 0.0
        },
        {
          "x": 0.0,
          "y": -0.30000001192092896,
          "z": 0.0
        },
        {
          "x": 0.10000000149011612,
          "y": -0.30000001192092896,
          "z": 0.0
        },
        {
          "x": 0.10000000149011612,
          "y": -0.20000000298023224,
          "z": 0.0
        },
        {
          "x": 0.0,
          "y": -0.20000000298023224,
          "z": 0.0
        }
      ]
    }
  ],
  "global_features": {
    "total_strokes": 1,
    "total_points": 13,
    "stroke_types_distribution": {
      "line": 0,
      "arc": 1,
      "circle": 0,
      "loop": 0
    }
  }
}
バージョン3：クラスタリング{
  "strokes": [
    {
      "points_count": 13,
      "bounding_box": {
        "width": 0.30000001192092896,
        "height": 0.7000000178813934,
        "depth": 0.0
      },
      "start_point": {
        "x": 0.0,
        "y": 0.4000000059604645,
        "z": 0.0
      },
      "end_point": {
        "x": 0.0,
        "y": -0.20000000298023224,
        "z": 0.0
      },
      "direction_vector": {
        "dx": 0.0,
        "dy": -1.0,
        "dz": 0.0,
        "magnitude": 0.6000000089406967
      },
      "total_length": 1.8944272274662421,
      "is_closed": false,
      "curvature": {
        "mean_curvature": 1.0085602429959504,
        "max_curvature": 2.0344440252026645,
        "total_angle_change": 12.102722915951404
      },
      "stroke_type": "arc",
      "simplified_points": [
        {
          "x": 0.0,
          "y": 0.4000000059604645,
          "z": 0.0
        },
        {
          "x": 0.10000000149011612,
          "y": 0.3499999940395355,
          "z": 0.0
        },
        {
          "x": 0.15000000596046448,
          "y": 0.25,
          "z": 0.0
        },
        {
          "x": 0.10000000149011612,
          "y": 0.15000000596046448,
          "z": 0.0
        },
        {
          "x": 0.0,
          "y": 0.10000000149011612,
          "z": 0.0
        },
        {
          "x": -0.10000000149011612,
          "y": 0.15000000596046448,
          "z": 0.0
        },
        {
          "x": -0.15000000596046448,
          "y": 0.25,
          "z": 0.0
        },
        {
          "x": -0.10000000149011612,
          "y": 0.3499999940395355,
          "z": 0.0
        },
        {
          "x": 0.0,
          "y": 0.4000000059604645,
          "z": 0.0
        },
        {
          "x": 0.0,
          "y": -0.30000001192092896,
          "z": 0.0
        },
        {
          "x": 0.10000000149011612,
          "y": -0.30000001192092896,
          "z": 0.0
        },
        {
          "x": 0.10000000149011612,
          "y": -0.20000000298023224,
          "z": 0.0
        },
        {
          "x": 0.0,
          "y": -0.20000000298023224,
          "z": 0.0
        }
      ]
    }
  ],
  "global_features": {
    "total_strokes": 1,
    "total_points": 13,
    "stroke_types_distribution": {
      "line": 0,
      "arc": 1,
      "circle": 0,
      "loop": 0
    },
    "bounding_volume": {
      "width": 0.30000001192092896,
      "height": 0.7000000178813934,
      "depth": 0.0,
      "aspect_ratio": 0.42857143465353503
    },
    "centroid": {
      "x": 0.015384615613864018,
      "y": 0.1076923064314402,
      "z": 0.0
    },
    "pattern_features": {
      "direction_clusters": {},
      "total_clusters": 0,
      "noise_points": 1
    },
    "topology_features": {
      "connected_components": 1,
      "max_component_size": 1,
      "isolated_strokes": 1,
      "avg_connections": 0.0,
      "max_connections": 0,
      "connection_distribution": {
        "mean": 0.0,
        "std": 0.0,
        "median": 0.0
      }
    },
    "connections": {
      "0": []
    }
  }
}

 バージョン比較ストローク（strokes）の特徴量一覧


項目
説明
バージョン1
バージョン2
バージョン3


points_count
簡略化された点の数
✅
✅
✅

bounding_box
ストロークの境界サイズ（幅、高さ、深さ）
✅
✅
✅

start_point
ストロークの開始点座標
✅
✅
✅

end_point
ストロークの終了点座標
✅
✅
✅

direction_vector
方向ベクトル情報
❌
✅
✅

total_length
ストロークの総長さ
✅
✅
✅

is_closed
閉じたストロークかどうか
✅
✅
✅

curvature
曲率情報
❌
✅
✅

stroke_type
ストロークの種類
❌
✅
✅

simplified_points
簡略化された点列の座標
✅
✅
✅


全体的な特徴量（global_features）一覧


項目
説明
バージョン1
バージョン2
バージョン3


total_strokes
ストロークの総数
✅
✅
✅

total_points
全ての点の総数
✅
✅
✅

stroke_types_distribution
各ストロークタイプの分布
❌
✅
✅

aspect_ratio（直接）
アスペクト比（幅÷高さ）
✅
❌
❌

bounding_volume
全体の境界サイズと比率
❌
❌
✅

centroid
図形の重心座標
✅
❌
✅

pattern_features
ストロークパターンの特徴
❌
❌
✅

topology_features
接続関係の特徴
❌
❌
✅

connections
ストローク間の接続情報
❌
❌
✅



 テスト結果ファイルサイズ比較


データ
生データサイズ[KB]
バージョン1[KB]
バージョン2[KB]
バージョン3[KB]


マスター
2
3
3
4

人_1
995
6
9
13

人_2
945
2
3
4

人_3
1073
3
4
6

人_4
1137
3
4
6

人_5
1181
3
4
6

人_6
506
2
3
4

人_7
873
9
13
29

人_8
281
4
5
9

人_9
538
5
7
9

人_10
867
8
11
19

人_11
546
6
8
15

人_12
717
5
7
10

人_13
1223
14
19
39

人_14
963
7
10
23

人_15
2077
18
25
62

人_16
1563
7
10
15

人_17
136
2
2
3

人_18
1355
10
15
28

人_19
1644
10
14
38

人_20
982
8
11
20

人_21
435
4
5
7

人_22
766
9
12
30



 まとめこれくらいのファイルサイズであればトークンエラーも出ない想定なので、STEP4にて精度を確認する。

 STEP4：特徴量データの精度テスト
 目的特徴量データを実際にAIに入力し、トークン数と精度のバランスを評価
最適なデータ形式を決定し、STEP5（AIモデル選定）につなげる

 実施内容各特徴量バージョンでAI測定


テスト項目
目的


成功率
鍵として正しく判定される割合

レスポンスの一貫性
同じデータで同じ結果が出るか

トークン数削減
どのバージョンが最も効果的か

処理時間の短縮
AIがレスポンスを返す時間


 実施方法英語リクエスト & 日本語レスポンス のプロンプトを使用
各特徴量バージョン（3種類） を 5回ずつ 測定
各AIモデルにリクエスト
精度（shape_id, score, reason）+ トークン数 + 処理時間 を記録

 テスト回数23データ×5回×3バージョン=345回
これを 各AIモデル(12個) に適用するため、総測定回数は 4140回

 テスト結果の集計結果処理効率
バージョン1が最も軽量（平均2,063トークン）
バージョン2は約34%増加（2,772トークン）
バージョン3は大幅増加（4,612トークン、V1の約2.2倍）

モデル別性能
Claude Opus: 最も安定したスコア（全バージョンで82-85）
GPT-4: バージョン2で最も良い結果（77）
Gemini Pro: バージョン2で最高（90）だがバージョン3で大幅下落（67）
Mistral: 最も一貫した結果（全バージョンで81-83）

処理時間
Claude Opus: 最も遅い（5-5.5秒）
その他モデル: 1.5-3秒程度
バージョン2が全体的に最速

推奨事項
バージョン1：効率重視
最小のトークン数
十分な精度
処理時間が適度

バージョン2：精度重視
最も高いスコア
最も安定した処理時間

バージョン3
トークン数が過大
精度の向上が見られない



 まとめテスト結果の精度ではバージョン2が良いが、トークンコストとのバランスをみて バージョン1でも十分 という結論にした。

 STEP5：プロンプト改善STEP2では生データの利用ができない（=そもそもAIが使えない）ので、STEP3・4でデータの特徴量化をしてきました。

ここまで精度に関してあまり触れてきませんでしたが、STEP5から生成AIの選定ための精度向上を行っていきました。

 この段階での問題点ズバリ、精度。

マスターデータはどのモデルも正解しますが、人が描いたデータは難しい。
絶対NGとしたいデータをOKにしたりしてしまう（逆も然り）
スコアのバラつき
例えば以下。
鍵ではなく錠

明らかに違うもの

鍵としては厳しい

特に「鍵じゃなくて錠だよね」は良いテストケースを作ってくれました。
!余談ですが謎解きでは割とあるらしいです。（コンセントと聞いて、プラグを思い描く人がいたり）

この辺りはアプリの肝になりそうです。

 プロンプトの改善STEP4からの展開として、
特徴量の項目を見直す
プロンプトを見直す
の2つかなと思いプロンプトの改善を実施。

 負例（negative example）の導入    You are an AI for shape recognition in a 3D VR application.
    Identify which of the following shapes the user has drawn in space:
    key: key - A key for opening doors and boxes, consisting of a long shaft and distinctive teeth pattern. Not to be confused with locks, padlocks, or combination locks.

    Note: ['padlock', 'door lock', 'combination lock'] are NOT keys and should score below 20%

    Respond strictly in the following JSON format (Reply in Japanese only):
    ```json
    {
        ""shape_id"": ""識別された形状ID"",
        ""score"": 0-100,
        ""reason"": ""必ず日本語で簡潔に1文で判断理由を説明してください。""
    }
    ```
    
    Notes:
    - Choose only one shape_id from the provided list: [key].
    - Score represents confidence level (higher = more confident).
    - Score must be between 0-100 (Do not exceed this range).
    - If the shape resembles a negative example, assign a score below 20%.
    - Keep the reason concise (one sentence).
    - Ensure response is fully in Japanese only.
    - Strictly follow the exact JSON format.
    - Do not create new shape IDs."
この部分
key: key - A key for opening doors and boxes, consisting of a long shaft and distinctive teeth pattern. Not to be confused with locks, padlocks, or combination locks.

Note: ['padlock', 'door lock', 'combination lock'] are NOT keys and should score below 20%
key: 長い軸と特徴的な歯模様からなる、扉や箱を開けるための鍵。ロック、南京錠、ダイヤル錠と混同しないでください。

Note: 南京錠、ドアロック、コンビネーションロック はキーではないため、スコアは 20% 未満にしてください。

 変更したプロンプトのテスト新プロンプトができたので、AIモデルを比較しつつテストをしていきます。
英語リクエスト & 日本語レスポンス の新旧プロンプトを使用し、3回ずつ測定
特徴量はバージョン1
各AIモデルにリクエスト
精度（shape_id, score, reason）+ トークン数 + 処理時間 を記録

 テスト回数23データ×3回×新旧プロンプト=138回
これを 各AIモデル(12個) に適用するため、総測定回数は 1656回

 テスト結果の集計結果新旧プロンプト比較での顕著な変化：
改善したモデル：
mistral-large-latest: 18→20正解（最高精度）
gpt-3.5-turbo-0125: 17→19正解
gemini-1.5-flash: 10→19正解（最も大きな改善）

悪化したモデル：
chatgpt-4o-latest: 16→11正解
gemini-1.5-pro: 16→11正解
claude-3-5-sonnet: 16→15正解


新プロンプトでの上位モデル：
mistral-large-latest（20/23正解、87%）
gpt-3.5-turbo-0125（19/23正解、83%）
gemini-1.5-flash（19/23正解、83%）

総合評価：
推奨モデル：
mistral-large-latest
最高の正解率
プロンプト改善で更に精度向上
処理時間も適度

gpt-3.5-turbo-0125
高い正解率
処理時間が短い
コスト効率が良い



注目点：
新プロンプトは一部のモデルで大幅な改善
一方で一部の高性能モデル（GPT-4系、Claude系）では逆効果
Mistral系が全体的に安定した性能


 まとめ新プロンプトを使用してさらにテストをしていきます。

 STEP6：傾向を見るためN増テストでモデルを決定ここからテストの回数を増やしてモデルとテストケースの傾向を見ていきます。
鍵として認識してほしいデータ、してほしくないデータ
モデル毎の精度と処理速度

 テスト方法英語リクエスト & 日本語レスポンス の新プロンプト
特徴量はバージョン1
レスポンスのキャッシュも懸念し、毎リクエストでテストデータを変える

 テスト回数23データ×20回=460回
これを 各AIモデル(12個) に適用するため、総測定回数は 5520回

 テスト結果の集計結果
!正解数は、OKとしたいものをOK、NGとしたいものをNGとした割合が80％を超えているものとしています。
単純に正解数が良いモデルを選びたいんですが、
NGとしたいものを判定できるものと出来ないもの
OKとしたいものを判定できるものと出来ないもの
が混在していて 1モデルですべてをカバーすることはできなそう です。

 まとめこの集計結果から今回は単一モデルの採用ではなく複数選べば判定出来るのかを考えました。

 NGを判定してくれるモデルGemini 1.5 Pro

 OKを判定してくれるモデルGPT3.5 Turbo
Claude 3 Opus
Ministral 8B
予想に反してGPT-4o系列がイマイチでした。

逆に比較的レガシーなモデルの精度が良い。

最近のモデルは推論が高いせいか、インプットしているデータが単純すぎて余計な事を考えすぎてる可能性もあるのかなと…。

 STEP7：モデルの複数利用STEP6ではモデル毎の傾向が見えてきました。
NGの判定に強いモデル
Gemini 1.5 Pro
OKの判定に強いモデル
GPT3.5 Turbo
Claude 3 Opus
Ministral 8B
複数利用するにあたり、同時利用では「どっちのモデルを正解にする？」かの判断ができないので、テストデータ23個の特徴量のデータから違いが無いか、グループを分けて調べました。
どっちのモデルも正解するテストデータ：Bothグループ（9データ）
OKを判定してくれるモデルで正解するテストデータ：Aグループ（10データ）
NGを判定してくれるモデルで正解するテストデータ：Bグループ（4データ）

 特徴整理グループAの特徴
NGを誤ってOKと判定が多い
寛容な判定傾向
OKとすべきものは正しく判定できている
グループB（gemini-1.5-pro）の特徴
OKを誤ってNGと判定が多い
厳格な判定傾向
NGとすべきものは正しく判定できている

 特徴が数値で出てくるかdef extract_features(json_data: Dict) -> Dict:
    """
    JSONデータから特徴量を抽出する
    """
    # ストロークの情報を取得
    stroke = json_data["strokes"][0]
    bounding_box = stroke["bounding_box"]

    # グローバル特徴量を取得
    global_features = json_data["global_features"]

    # 特徴量を計算
    bounding_box_area = bounding_box["width"] * bounding_box["height"]
    point_density = global_features["total_points"] / stroke["total_length"]
    point_spacing = stroke["total_length"] / (global_features["total_points"] - 1)

    return {
        "aspect_ratio": global_features["aspect_ratio"],
        "bounding_box_area": bounding_box_area,
        "stroke_length": stroke["total_length"],
        "point_density": point_density,
        "point_spacing": point_spacing,
        "total_points": global_features["total_points"],
        "total_strokes": global_features["total_strokes"],
        "is_closed": stroke["is_closed"]
    }
特徴量を抽出した結果（Json）[
  {
    "id": "20250201094703_d58d",
    "group": "B",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 0.2864418782549316,
      "point_density": 48.87553483900871,
      "point_spacing": 0.02203399063499474,
      "total_points": 14,
      "total_strokes": 7,
      "is_closed": false
    }
  },
  {
    "id": "20250201094729_6eb5",
    "group": "A",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 2.972842336564683,
      "point_density": 1.3455136691245677,
      "point_spacing": 0.9909474455215611,
      "total_points": 4,
      "total_strokes": 2,
      "is_closed": true
    }
  },
  {
    "id": "20250201125254_d2f8",
    "group": "A",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 3.2981672481627813,
      "point_density": 1.8191921599313237,
      "point_spacing": 0.6596334496325562,
      "total_points": 6,
      "total_strokes": 3,
      "is_closed": false
    }
  },
  {
    "id": "20250201125329_6406",
    "group": "A",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 1.8928977128523254,
      "point_density": 3.169743382994985,
      "point_spacing": 0.3785795425704651,
      "total_points": 6,
      "total_strokes": 3,
      "is_closed": false
    }
  },
  {
    "id": "20250201125411_5866",
    "group": "A",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 0.658235417878163,
      "point_density": 9.115279787497819,
      "point_spacing": 0.1316470835756326,
      "total_points": 6,
      "total_strokes": 3,
      "is_closed": false
    }
  },
  {
    "id": "20250201125458_b4c6",
    "group": "B",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 1.4749149582504322,
      "point_density": 2.7120207694854925,
      "point_spacing": 0.49163831941681074,
      "total_points": 4,
      "total_strokes": 2,
      "is_closed": true
    }
  },
  {
    "id": "20250201125558_b57a",
    "group": "both",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 1.205792279070906,
      "point_density": 18.245265276496518,
      "point_spacing": 0.057418679955757425,
      "total_points": 22,
      "total_strokes": 11,
      "is_closed": false
    }
  },
  {
    "id": "20250201125610_d25d",
    "group": "both",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 0.6661817158618174,
      "point_density": 12.008735468896294,
      "point_spacing": 0.0951688165516882,
      "total_points": 8,
      "total_strokes": 4,
      "is_closed": true
    }
  },
  {
    "id": "20250201125625_86ce",
    "group": "A",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 0.7743184535111106,
      "point_density": 12.914583082264759,
      "point_spacing": 0.08603538372345673,
      "total_points": 10,
      "total_strokes": 5,
      "is_closed": false
    }
  },
  {
    "id": "20250201125630_c495",
    "group": "both",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 0.9882440906373504,
      "point_density": 18.21412358599708,
      "point_spacing": 0.058132005331608845,
      "total_points": 18,
      "total_strokes": 9,
      "is_closed": false
    }
  },
  {
    "id": "20250201125645_a1c0",
    "group": "B",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 0.6627509464379474,
      "point_density": 18.10634909613599,
      "point_spacing": 0.060250086039813404,
      "total_points": 12,
      "total_strokes": 6,
      "is_closed": false
    }
  },
  {
    "id": "20250201125654_b5f3",
    "group": "A",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 1.0214801902708581,
      "point_density": 9.78971505785969,
      "point_spacing": 0.11349779891898423,
      "total_points": 10,
      "total_strokes": 5,
      "is_closed": false
    }
  },
  {
    "id": "20250201125659_ca60",
    "group": "both",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 0.39031388192410127,
      "point_density": 81.98529819706125,
      "point_spacing": 0.012590770384648429,
      "total_points": 32,
      "total_strokes": 16,
      "is_closed": false
    }
  },
  {
    "id": "20250201125731_a272",
    "group": "A",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 0.6517359407496203,
      "point_density": 24.549819949467505,
      "point_spacing": 0.043449062716641354,
      "total_points": 16,
      "total_strokes": 8,
      "is_closed": false
    }
  },
  {
    "id": "20250201125839_f83b",
    "group": "A",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 0.13537772839427523,
      "point_density": 310.243054734076,
      "point_spacing": 0.003301895814494518,
      "total_points": 42,
      "total_strokes": 21,
      "is_closed": true
    }
  },
  {
    "id": "20250201125919_66fb",
    "group": "both",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 0.12229126413320432,
      "point_density": 130.83518363643856,
      "point_spacing": 0.008152750942213622,
      "total_points": 16,
      "total_strokes": 8,
      "is_closed": true
    }
  },
  {
    "id": "20250201125957_1cee",
    "group": "B",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 0.7926345801743692,
      "point_density": 2.523230817863165,
      "point_spacing": 0.7926345801743692,
      "total_points": 2,
      "total_strokes": 1,
      "is_closed": true
    }
  },
  {
    "id": "20250201125957_80a4",
    "group": "both",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 0.7706478596578684,
      "point_density": 31.142628503055697,
      "point_spacing": 0.03350642868077689,
      "total_points": 24,
      "total_strokes": 12,
      "is_closed": false
    }
  },
  {
    "id": "20250201130112_ea10",
    "group": "A",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 0.06997385308412352,
      "point_density": 314.40315246826987,
      "point_spacing": 0.00333208824210112,
      "total_points": 22,
      "total_strokes": 11,
      "is_closed": true
    }
  },
  {
    "id": "20250201130239_e39a",
    "group": "both",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 0.6499139786120754,
      "point_density": 27.695972993902856,
      "point_spacing": 0.03823023403600444,
      "total_points": 18,
      "total_strokes": 9,
      "is_closed": false
    }
  },
  {
    "id": "20250201130300_c2cf",
    "group": "both",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 1.1455820785717148,
      "point_density": 6.983349468921698,
      "point_spacing": 0.16365458265310212,
      "total_points": 8,
      "total_strokes": 4,
      "is_closed": true
    }
  },
  {
    "id": "20250201130350_2fb5",
    "group": "A",
    "features": {
      "aspect_ratio": 0.0,
      "bounding_box_area": 0.0,
      "stroke_length": 0.000874205813534764,
      "point_density": 22877.90779968849,
      "point_spacing": 4.601083229130337e-05,
      "total_points": 20,
      "total_strokes": 10,
      "is_closed": true
    }
  },
  {
    "id": "master_afbd",
    "group": "both",
    "features": {
      "aspect_ratio": 0.42857143465353503,
      "bounding_box_area": 0.2100000137090685,
      "stroke_length": 1.8944272274662421,
      "point_density": 6.862232452912554,
      "point_spacing": 0.15786893562218685,
      "total_points": 13,
      "total_strokes": 1,
      "is_closed": false
    }
  }
]

 結果抽出した結果から以下を閾値とした。
point_density > 50
ノイズや意図しない入力の可能性

total_strokes <= 2
形状が単純すぎて判別が必要

total_length < 0.2
極端に短い入力は意図的でない可能性


 モデルの使い分け方針GPT-3.5 Turbo：OKを見逃したくないケース
Gemini 1.5 Pro：NGをしっかり判定したいケース
Mistral Large：OK・NGに偏って無く、正解数がトップなモデル

 サーバーへの実装特徴量からグループ分けをするコード（Python）def classify_drawing(features: Dict[str, Any]) -> DrawingGroup:
    """
    特徴量に基づいてグループを分類
    - GROUP_A: gpt-3.5-turbo-0125（OKの判定に強い）
    - GROUP_B: gemini-1.5-pro（NGの判定に強い）
    - GROUP_BOTH: mistral-large-latest（両方の判定が安定）
    """
    global_features = features.get("global_features", {})
    strokes = features.get("strokes", [])
    
    total_points = global_features.get("total_points", 0)
    total_length = sum(stroke.get("total_length", 0) for stroke in strokes)
    point_density = total_points / total_length if total_length > 0 else 0
    total_strokes = global_features.get("total_strokes", 0)
    
    # mistral-large-latestで処理（安定した判定が必要な場合）
    if (6 < point_density < 82) and (4 <= total_strokes <= 16):
        return DrawingGroup.BOTH
    # gemini-1.5-proで処理（NGの判定が重要な場合）
    elif point_density > 50 or total_strokes <= 2 or total_length < 0.2:
        return DrawingGroup.GROUP_B
    # それ以外はgpt-3.5-turbo-0125で処理（OKの判定が重要な場合）
    else:
        return DrawingGroup.GROUP_A

 総括長くなりましたが、ここまで読んでいただきありがとうございます。

研究という意味でも初めて生成AIのAPIを利用してしてきましたが、相当骨の折れる作業でした。
反省点もあります。

『テストデータが圧倒的に少ない』
これに尽きるかなと思います。

本当は鍵以外もやりたかったのですが、時間の都合でフェスには間に合わなかったものの大変勉強になりました。

 おまけテストで作成した主要部分のPythonコードを載せますので、API利用時の参考にしていただけたらと思います。
プロンプトの作成from typing import List, Dict

def create_prompt_en_ja(shape_infos: List[Dict[str, str]]) -> str:
    """ AIに送るプロンプトを作成（英語リクエスト & 日本語レスポンス） """
    shape_list = [shape["shape_id"] for shape in shape_infos]
    shapes_desc = [
        f"{shape['shape_id']}: {shape['name_en']} - {shape['description_en']}"
        for shape in shape_infos
    ]
    negative_examples = [
        f"Note: {shape['negative_examples']['en']} are NOT {shape['name_en']}s " +
        f"and should score below {shape['negative_examples']['score_threshold']}%"
        for shape in shape_infos
    ]

    prompt = f"""You are an AI for shape recognition in a 3D VR application.
    Identify which of the following shapes the user has drawn in space:
    {", ".join(shapes_desc)}

    {" ".join(negative_examples)}

    Respond strictly in the following JSON format (Reply in Japanese only):
    ```json
    {{
        "shape_id": "識別された形状ID",
        "score": 0-100,
        "reason": "必ず日本語で簡潔に1文で判断理由を説明してください。"
    }}
    ```
    
    Notes:
    - Choose only one shape_id from the provided list: [{", ".join(shape_list)}].
    - Score represents confidence level (higher = more confident).
    - Score must be between 0-100 (Do not exceed this range).
    - If the shape resembles a negative example, assign a score below {shape_infos[0]['negative_examples']['score_threshold']}%.
    - Keep the reason concise (one sentence).
    - Ensure response is fully in Japanese only.
    - Strictly follow the exact JSON format.
    - Do not create new shape IDs.
    """
    return prompt.strip()
各AIサービスへの問い合わせimport json
import time
from typing import Dict, Any, Tuple

from openai import OpenAI
import google.generativeai as genai
from anthropic import AsyncAnthropic
from mistralai import Mistral
import tiktoken

def count_tokens(text: str, model: str) -> int:
    """指定されたモデルでテキストのトークン数を計算
    
    Args:
        text (str): テキスト
        model (str): モデル名
    Returns:
        int: トークン数
    """
    try:
        # OpenAI標準モデル用のエンコーディング
        if model in ["gpt-4-turbo", "gpt-3.5-turbo-0125"]:
            try:
                encoding = tiktoken.encoding_for_model(model)
            except:
                encoding = tiktoken.get_encoding("cl100k_base")
        # その他すべてのモデル用のエンコーディング
        else:
            encoding = tiktoken.get_encoding("cl100k_base")
        
        return len(encoding.encode(text))
    except Exception as e:
        print(f"Token counting error for {model}: {e}")
        return 0

async def call_openai(api_key: str, prompt: str, model: str, drawing_data: Dict[str, Any]) -> Dict[str, Any]:
    """OpenAI APIに問い合わせる
    
    Args:
        api_key (str): OpenAI APIキー
        prompt (str): プロンプト
        model (str): モデル名
        drawing_data (Dict[str, Any]): 描画データ
    Returns:
        Dict
    """
    client = OpenAI(api_key=api_key)
    start_time = time.time()

    drawing_info = json.dumps(drawing_data, ensure_ascii=False)
    full_prompt = f"{prompt}\n\nDrawing Data: {drawing_info}"
    
    # リクエストのトークン数を計測
    request_tokens = count_tokens(full_prompt, model)

    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": full_prompt}],
            temperature=1.0,
            max_tokens=200
        )
        raw_response = response.choices[0].message.content
        
        # レスポンスのトークン数を計測
        response_tokens = count_tokens(raw_response, model)
        
        result = parse_response(raw_response, time.time() - start_time)
        result.update({
            "request_tokens": request_tokens,
            "response_tokens": response_tokens,
            "total_tokens": request_tokens + response_tokens
        })
        return result
    except Exception as e:
        print(f"OpenAI Error ({model}): {e}")
        return {
            "error": str(e),
            "response_time": time.time() - start_time,
            "request_tokens": request_tokens,
            "response_tokens": 0,
            "total_tokens": request_tokens
        }

async def call_googleai(api_key: str, prompt: str, model: str, drawing_data: Dict[str, Any]) -> Dict[str, Any]:
    """Google Geminiに問い合わせる
    
    Args:
        api_key (str): Google APIキー
        prompt (str): プロンプト
        model (str): モデル名
        drawing_data (Dict[str, Any]): 描画データ
    Returns:
        Dict
    """
    start_time = time.time()
    
    drawing_info = json.dumps(drawing_data, ensure_ascii=False)
    full_prompt = f"{prompt}\n\nDrawing Data: {drawing_info}"
    
    # リクエストのトークン数を計測
    request_tokens = count_tokens(full_prompt, model)

    try:
        genai.configure(api_key=api_key)
        model_instance = genai.GenerativeModel(model)

        response = model_instance.generate_content(full_prompt)
        raw_response = response.text.encode("utf-8").decode("utf-8")
        
        # レスポンスのトークン数を計測
        response_tokens = count_tokens(raw_response, model)
        
        result = parse_response(raw_response, time.time() - start_time)
        result.update({
            "request_tokens": request_tokens,
            "response_tokens": response_tokens,
            "total_tokens": request_tokens + response_tokens
        })
        return result
    except Exception as e:
        print(f"Google AI Error ({model}): {e}")
        return {
            "error": str(e),
            "response_time": time.time() - start_time,
            "request_tokens": request_tokens,
            "response_tokens": 0,
            "total_tokens": request_tokens
        }

async def call_anthropic(api_key: str, prompt: str, model: str, drawing_data: Dict[str, Any]) -> Dict[str, Any]:
    """Anthropic Claudeに問い合わせる
    
    Args:
        api_key (str): Anthropic APIキー
        prompt (str): プロンプト
        model (str): モデル名
        drawing_data (Dict[str, Any]): 描画データ
    Returns:
        Dict
    """
    client = AsyncAnthropic(api_key=api_key)
    start_time = time.time()

    drawing_info = json.dumps(drawing_data, ensure_ascii=False)
    full_prompt = f"{prompt}\n\nDrawing Data: {drawing_info}"
    
    # リクエストのトークン数を計測
    request_tokens = count_tokens(full_prompt, model)

    try:
        response = await client.messages.create(
            model=model,
            max_tokens=200,
            system="You are an AI specialized in recognizing 3D VR shapes, particularly keys.",
            messages=[{"role": "user", "content": full_prompt}]
        )
        raw_response = response.content[0].text
        
        # レスポンスのトークン数を計測
        response_tokens = count_tokens(raw_response, model)
        
        result = parse_response(raw_response, time.time() - start_time)
        result.update({
            "request_tokens": request_tokens,
            "response_tokens": response_tokens,
            "total_tokens": request_tokens + response_tokens
        })
        return result
    except Exception as e:
        print(f"Anthropic API Error ({model}): {e}")
        return {
            "error": str(e),
            "response_time": time.time() - start_time,
            "request_tokens": request_tokens,
            "response_tokens": 0,
            "total_tokens": request_tokens
        }

async def call_mistralai(api_key: str, prompt: str, model: str, drawing_data: Dict[str, Any]) -> Dict[str, Any]:
    """Mistral AIに問い合わせる
    
    Args:
        api_key (str): Mistral AI APIキー
        prompt (str): プロンプト
        model (str): モデル名
        drawing_data (Dict[str, Any]): 描画データ
    Returns:
        Dict
    """
    client = Mistral(api_key=api_key)
    start_time = time.time()

    drawing_info = json.dumps(drawing_data, ensure_ascii=False)
    full_prompt = f"{prompt}\n\nDrawing Data: {drawing_info}"
    
    # リクエストのトークン数を計測
    request_tokens = count_tokens(full_prompt, model)

    messages = [
        {"role": "system", "content": "You are an AI specialized in recognizing 3D VR shapes, particularly keys."},
        {"role": "user", "content": full_prompt}
    ]

    try:
        response = await client.chat.complete_async(
            model=model,
            messages=messages,
            max_tokens=200
        )
        raw_response = response.choices[0].message.content
        
        # レスポンスのトークン数を計測
        response_tokens = count_tokens(raw_response, model)
        
        result = parse_response(raw_response, time.time() - start_time)
        result.update({
            "request_tokens": request_tokens,
            "response_tokens": response_tokens,
            "total_tokens": request_tokens + response_tokens
        })
        return result
    except Exception as e:
        print(f"Mistral AI Error ({model}): {e}")
        return {
            "error": str(e),
            "response_time": time.time() - start_time,
            "request_tokens": request_tokens,
            "response_tokens": 0,
            "total_tokens": request_tokens
        }

def parse_response(response: str, response_time: float) -> Dict[str, Any]:
    """AIレスポンスをJSONとして解析し、トークン数を記録
    
    Args:
        response (str): AIレスポンス
        response_time (float): レスポンス時間
    Returns:
        Dict
    """
    try:
        json_text = response.strip()
        if "```json" in json_text:
            json_text = json_text.split("```json")[1].split("```")[0].strip()

        parsed_json = json.loads(json_text)
        parsed_json["response_time"] = round(response_time, 2)
        return parsed_json
    except json.JSONDecodeError as e:
        print(f"JSON Decode Error: {e}")
        return {"error": "JSON Parsing Failed", "response_time": round(response_time, 2)}
mainimport sys
import asyncio
import time
from datetime import datetime

from libs.common import load_json, save_results
from libs.prompt_manager import create_prompt_en_ja
from libs.ai_services import call_openai, call_googleai, call_anthropic, call_mistralai

# 描画ファイルの読み込み
DRAWING_DATA_FILES = [...]

CONFIG_FILE = "config.json"
EXT = ".json"

async def run_test(drawing_file, round_number):
    """1つのデータファイルに対してAIテストを実行"""
    print(f"\n=== {drawing_file} の処理開始 (Round {round_number}) ===")

    # 設定の読み込み
    config = load_json(CONFIG_FILE)
    if not config:
        print("Error: 設定ファイルの読み込みに失敗")
        return

    # 描画データの読み込み
    drawing_data = load_json(drawing_file + EXT)
    if not drawing_data:
        print("Error: 描画データが存在しない")
        return

    # プロンプトの生成
    prompt = create_prompt_en_ja(config["shapes"])

    results = []
    output_file = f"test_result/STEP6/ai_results-{drawing_file}.csv"

    # AIサービスごとに処理を実行
    for service, model_list in config["models"].items():
        for model in model_list:
            print(f"→ Testing {drawing_file} with {service} (Model: {model})...")

            try:
                if service == "openai":
                    result = await call_openai(config["api_keys"]["openai"], prompt, model, drawing_data)
                elif service == "googleai":
                    result = await call_googleai(config["api_keys"]["googleai"], prompt, model, drawing_data)
                elif service == "anthropic":
                    result = await call_anthropic(config["api_keys"]["anthropic"], prompt, model, drawing_data)
                elif service == "mistralai":
                    result = await call_mistralai(config["api_keys"]["mistralai"], prompt, model, drawing_data)
                else:
                    print(f"Warning: 非サポート - {service}")
                    continue

                results.append([
                    datetime.now().isoformat(),
                    round_number,
                    service,
                    model,
                    round(result.get("response_time", 0), 2),
                    result.get("shape_id", "Unknown"),
                    result.get("score", 0),
                    result.get("reason", "Unknown"),
                    result.get("request_tokens", 0),
                    result.get("response_tokens", 0),
                    result.get("total_tokens", 0)
                ])

                # リクエスト過多でエラーが起きるので、API呼び出し間の待機時間に10秒
                time.sleep(10)

            except Exception as e:
                print(f"Error processing {model}: {str(e)}")
                continue

    # 結果をCSVに追記
    save_results(results, output_file, mode='a')
    
    # リクエスト過多でエラーが起きるので、ファイル処理完了後の待機時間に30秒
    time.sleep(30)

if __name__ == "__main__":
    # Windowsのasyncioイベントループポリシー設定
    if sys.platform == "win32":
        asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

    # ラウンド番号の入力
    round_number = int(input("ラウンド番号を入力してください（1-20）: "))
    if round_number < 1 or round_number > 20:
        print("Error: ラウンド番号は1-20の範囲で指定してください")
        sys.exit(1)

    loop = asyncio.get_event_loop()
    
    # 全データファイルを1周
    for data_file in DRAWING_DATA_FILES:
        loop.run_until_complete(run_test(data_file, round_number))
コンフィグファイル{
    "api_keys": {
        "openai": "...",
        "googleai": "...",
        "anthropic": "...",
        "mistralai" :"..."
    },
    "models": {
        "openai": ["chatgpt-4o-latest", "gpt-4o-mini", "gpt-4-turbo", "gpt-3.5-turbo-0125"],
        "googleai": ["gemini-2.0-flash-exp", "gemini-1.5-flash", "gemini-1.5-pro"],
        "anthropic": ["claude-3-5-sonnet-20241022", "claude-3-5-haiku-20241022", "claude-3-opus-20240229"],
        "mistralai": ["mistral-large-latest", "ministral-8b-latest"]
    },
    "shapes": [{
        "shape_id": "key",
        "name_ja": "鍵",
        "name_en": "key",
        "prefab_name": "Item_Key",
        "threshold": 60,
        "description_ja": "鍵は、通常、持ち手（ヘッド）、軸（シャフト）、ギザギザ（ビット）の3つの部分から構成されます。持ち手は円形または四角形で、軸は細長く、先端には独特のギザギザした形状があります。南京錠や錠前、ダイヤル錠は含みません。",
        "description_en": "A key typically consists of three parts: a head (handle), a shaft, and a series of teeth (bit). The head is usually round or square, the shaft is elongated, and the tip has distinctive notches. Padlocks, locks, and combination locks are NOT included.",
        "negative_examples": {
            "ja": ["南京錠", "錠前", "ダイヤル錠", "ただの棒状の物体", "リング状の形", "円盤形のもの"],
            "en": ["padlock", "door lock", "combination lock", "plain rod-like objects", "ring shapes", "disk-shaped items"],
            "score_threshold": 20
        }
    }],
    "output_csv": "ai_results.csv"
}
サービス	モデル	API指定名
OpenAI	GPT-4o	chatgpt-4o-latest
OpenAI	GPT-4o Mini	gpt-4o-mini
OpenAI	GPT-4 Turbo	gpt-4-turbo
OpenAI	GPT-3.5 Turbo	gpt-3.5-turbo-0125
GoogleAI	Gemini 2.0 Flash	gemini-2.0-flash-exp
GoogleAI	Gemini 1.5 Flash	gemini-1.5-flash
GoogleAI	Gemini 1.5 Pro	gemini-1.5-pro
Anthropic	Claude 3.5 Sonnet	claude-3-5-sonnet-20241022
Anthropic	Claude 3.5 Haiku	claude-3-5-haiku-20241022
Anthropic	Claude 3 Opus	claude-3-opus-20240229
MistralAI	Ministral Large	mistral-large-latest
MistralAI	Ministral 8B	ministral-8b-latest
ケース	リクエスト言語	レスポンス言語	目的
ケース1	日本語	日本語	日本語のみでやり取りした場合の精度・速度を確認
ケース2	英語	英語	AIが最も得意な英語のみでやり取りした場合の精度を確認
ケース3	英語	日本語	精度を維持しつつ、レスポンスを日本語にできるか確認
ケース	平均スコア	平均処理時間（s）	判定
ケース1(Ja→Ja)	79.7	2.11	✕
ケース2(En→En)	89.8	1.83	○
ケース3(En→Ja)	88.8	2.03	△
バージョン	内容	ファイルサイズ
基本特徴量	最小限の情報	小
方向ベクトル追加	基本特徴量 + 方向情報（ベクトル、曲率、ストローク分類）	中
クラスタリング	方向ベクトル + クラスタリング（方向のパターン分析）	大
項目	説明	バージョン1	バージョン2	バージョン3
points_count	簡略化された点の数	✅	✅	✅
bounding_box	ストロークの境界サイズ（幅、高さ、深さ）	✅	✅	✅
start_point	ストロークの開始点座標	✅	✅	✅
end_point	ストロークの終了点座標	✅	✅	✅
direction_vector	方向ベクトル情報	❌	✅	✅
total_length	ストロークの総長さ	✅	✅	✅
is_closed	閉じたストロークかどうか	✅	✅	✅
curvature	曲率情報	❌	✅	✅
stroke_type	ストロークの種類	❌	✅	✅
simplified_points	簡略化された点列の座標	✅	✅	✅
項目	説明	バージョン1	バージョン2	バージョン3
total_strokes	ストロークの総数	✅	✅	✅
total_points	全ての点の総数	✅	✅	✅
stroke_types_distribution	各ストロークタイプの分布	❌	✅	✅
aspect_ratio（直接）	アスペクト比（幅÷高さ）	✅	❌	❌
bounding_volume	全体の境界サイズと比率	❌	❌	✅
centroid	図形の重心座標	✅	❌	✅
pattern_features	ストロークパターンの特徴	❌	❌	✅
topology_features	接続関係の特徴	❌	❌	✅
connections	ストローク間の接続情報	❌	❌	✅
データ	生データサイズ[KB]	バージョン1[KB]	バージョン2[KB]	バージョン3[KB]
マスター	2	3	3	4
人_1	995	6	9	13
人_2	945	2	3	4
人_3	1073	3	4	6
人_4	1137	3	4	6
人_5	1181	3	4	6
人_6	506	2	3	4
人_7	873	9	13	29
人_8	281	4	5	9
人_9	538	5	7	9
人_10	867	8	11	19
人_11	546	6	8	15
人_12	717	5	7	10
人_13	1223	14	19	39
人_14	963	7	10	23
人_15	2077	18	25	62
人_16	1563	7	10	15
人_17	136	2	2	3
人_18	1355	10	15	28
人_19	1644	10	14	38
人_20	982	8	11	20
人_21	435	4	5	7
人_22	766	9	12	30
テスト項目	目的
成功率	鍵として正しく判定される割合
レスポンスの一貫性	同じデータで同じ結果が出るか
トークン数削減	どのバージョンが最も効果的か
処理時間の短縮	AIがレスポンスを返す時間
はじめに

概要

前提

テストの進め方とステップ

STEP0：テストデータ集め

STEP1：プロンプトの方針決め

目的

テスト内容

ケース毎のプロンプト

テスト結果

まとめ

STEP2：人のデータでのテスト（生データ）

目的

実施内容

問題

まとめ

STEP3：座標データの特徴量化（トークン数削減）

目的

テスト内容

特徴量化のバージョンを作成

特徴量化したJsonデータ

バージョン比較

テスト結果

まとめ

STEP4：特徴量データの精度テスト

目的

実施内容

実施方法

テスト回数

テスト結果の集計結果

まとめ

STEP5：プロンプト改善

この段階での問題点

プロンプトの改善

負例（negative example）の導入

変更したプロンプトのテスト

テスト回数

テスト結果の集計結果

まとめ

STEP6：傾向を見るためN増テストでモデルを決定

テスト方法

テスト回数

テスト結果の集計結果

まとめ

NGを判定してくれるモデル

OKを判定してくれるモデル

STEP7：モデルの複数利用

特徴整理

特徴が数値で出てくるか

結果

モデルの使い分け方針

サーバーへの実装

総括

おまけ

Discussion