📑

AOAI GPT-4o Realtime API で Function Calling を試してみる

momosuke | Ryosuke Hyakuta

2024/11/08に公開

 概要Azure OpenAI GPT-4o Realtime API で Function Calling (Tools) が使えるようなので SDK 内の JavaScript サンプル アプリに機能追加する形で実装して、その動作を確認してみます。

 Azure OpenAI の GPT-4o Realtime API おさらい2024 年 10 月 1 日に Azure OpenAI で GPT-4o Realtime AP モデルがデプロイ可能となりました。
OpenAI のモデル自体の情報については、npaka さんのこちらの記事が分かりやすいので、詳細は割愛させていただきますが、要するに低遅延の音声会話を実現する GPT モデルであり、それが Azure OpenAI サービスでサポートされました。

 開発方法執筆時点でプレビュー中の GPT-4o Realtime API ですが、SDK が公開されております。
https://github.com/Azure-Samples/aoai-realtime-audio-sdk/tree/main
以下のように WebSocket 周りの実装が抽象化されており、直観的に API サーバーとのコミュニケーションが行えるようになっています。

 Before (OpenAI API)const url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-10-01";
const ws = new WebSocket(url, {
    headers: {
        "Authorization": "Bearer " + process.env.OPENAI_API_KEY,
        "OpenAI-Beta": "realtime=v1",
    },
});

 After (Azure OpenAI GPT-4o Realtime API SDK)realtimeStreaming = new LowLevelRTClient(
  new URL(endpoint),
  { key: apiKey },
  { deployment: deploymentOrModel }
);

 Function Calling のサポートMS Learn には書かれていませんが、SDK には Function Calling がサポートされている旨が記載されています。
Works with text messages, function tool calling, and many other existing capabilities from other endpoints like /chat/completions

 使い方会話開始時に送信する JSON に tools プロパティを追加します。
{
  "type": "session.update",
  "session": {
    "voice": "alloy",
    "instructions": "Call provided tools if appropriate for the user's input.",
    "input_audio_format": "pcm16",
    "input_audio_transcription": {
      "model": "whisper-1"
    },
    "turn_detection": {
      "threshold": 0.4,
      "silence_duration_ms": 600,
      "type": "server_vad"
    },
    // ↓ ここに tools プロパティを追加
    tools: [
      {
          type: "function",
          name: "get_my_name",
          description: "Get the name of the user",
      },
    ],
  }
}
会話の中で Function を使うという判断になった場合、Function に渡す引数の検討なども行った結果、サーバーは item.type が function_call の response.output_item.done コマンド メッセージを送信してきます（以下例）。
// `response.output_item.done` コマンド メッセージ
{
    "type": "response.output_item.done",
    "event_id": "event_AQz0q9o4SIl68fuMPKBXN",
    "response_id": "resp_AQz0pjscf3X2HryMovLZU",
    "output_index": 0,
    "item": {
        "id": "item_AQz0pX5Q7UAMM3z8qzLHY",
        "object": "realtime.item",
        "type": "function_call",
        "status": "completed",
        "name": "get_my_name",
        "call_id": "call_cTE3ifBo5XukndV8",
        "arguments": "{}"
    }
}
このメッセージを拾って、Function の処理を行います。Conversation のアイテムとして追加した上で、response.create コマンド メッセージを送信することで、Function の処理が完了したことをサーバーに通知し返答の生成を促します。
case "response.output_item.done":
  const { item } = message;
  if (item.type === "function_call") {
      console.log("message", message);
      console.log("function_call", item);
      if (item.name === "get_my_name") {
          realtimeStreaming.send({
              type: "conversation.item.create",
              item: {
                  type: "function_call_output",
                  call_id: item.call_id,
                  output: get_my_name(),
              },
          });
          realtimeStreaming.send({
              type: "response.create",
          });
      }
  }
  break;

 結果

 現時点で不明瞭な点
 Function を判定できるコマンドが 2 種類あるFucntion とそれに渡す引数を含むコマンドとして、response.function_call_arguments.done と response.output_item.done がある。どちらを採用すればいいのか現時点では不明です。

 Function の実行結果を送るタイミング上記の通り、2 つのどちらかのコマンドを受け取ったタイミングで「関数実行 → 返答生成依頼」を行えるが、SDK の README に以下の記述があります。
Sending the response.create command before the paired response.done command for the prior response arrives (e.g. immediately after an response.function_call_arguments.done or response.output_item.done) may produce unexpected behavior and race conditions.
つまり、厳密には「関数実行 → response.done コマンド受信確認 → 返答生成依頼」のステップを踏む必要があると私は解釈しました。現状、特に気にせず関数の実行結果を取得次第、送っているのですが動作しているのですが、本番環境として運用する際は気にした方が良いかもしれません。

 まとめ今回は引数の無い関数を呼び出す形で Function Calling を試してみました。
実際には引数を取る関数を呼び出すことが多いと思います。引数の取り方や Function の実装方法については、SDK のドキュメントを参照してください。
参考: SDK の tools 定義例オブジェクトを引数に取る関数の定義例です。オブジェクトのプロパティを指定することや、Enum 型もサポートされています。
https://github.com/Azure-Samples/aoai-realtime-audio-sdk?tab=readme-ov-file#api-details
"tools": [
  {
    "type": "function",
    "name": "get_weather_for_location",
    "description": "gets the weather for a location",
    "parameters": {
      "type": "object",
      "properties": {
        "location": {
          "type": "string",
          "description": "The city and state e.g. San Francisco, CA"
        },
        "unit": {
          "type": "string",
          "enum": [
            "c",
            "f"
          ]
        }
      },
      "required": [
        "location",
        "unit"
      ]
    }
  }
]

 参考資料Introducing the Realtime API
Realtime API Docs
音声とオーディオ用の GPT-4o Realtime API (プレビュー)
Azure OpenAI GPT-4o Audio and /realtime: Public Preview Documentation

GitHubで編集を提案

Microsoft (有志)Publication

Microsoft Azureをはじめとする最新技術情報をお届けします。 ※このPublicationは日本マイクロソフトまたは米Microsoft所属社員による個人の見解であり、所属する組織の公式見解ではありません。 ※Publicationに参加希望の社員は @07JP27までご連絡ください。

Discussion

ログインするとコメントできます