Closed10

OpenAI Realtime APIを調査する

Masa CentoMasa Cento

まだOpenAIのAPIはリリースされていない

curl https://api.openai.com/v1/models -H "Authorization: Bearer ${OPENAI_API_KEY}" | grep "realtime"
Masa CentoMasa Cento

Azure OpenAIではUS-East2リージョンでリリースされている

modelをdeployしてendpointとkeyを取得できた

Masa CentoMasa Cento

Javascriptのサンプルコードを実行

git clone https://github.com/Azure-Samples/aoai-realtime-audio-sdk
cd aoai-realtime-audio-sdk/javascript/samples/
sh download-pkg.sh
cd web
npm install
#npm install -g vite
npm run dev

viteがなかったので入れたが無事起動

Masa CentoMasa Cento

ログはこんな感じ

console.log
session.created
input_audio_buffer.speech_started
{
  "type": "input_audio_buffer.speech_stopped",
  "event_id": "event_AE7AFsr4Ls4kdX058ckLB",
  "audio_end_ms": 2080,
  "item_id": "item_AE7AEZvFcBygwZFQ8o4Yp"
}
{
  "type": "input_audio_buffer.committed",
  "event_id": "event_AE7AFwlWaTDfJpOtdyPtX",
  "previous_item_id": null,
  "item_id": "item_AE7AEZvFcBygwZFQ8o4Yp"
}
{
  "type": "conversation.item.created",
  "event_id": "event_AE7AFb26nQzlPrjeNg1oc",
  "previous_item_id": null,
  "item": {
    "id": "item_AE7AEZvFcBygwZFQ8o4Yp",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "transcript": null
      }
    ]
  }
}
{
  "type": "response.created",
  "event_id": "event_AE7AF57xBshOaI3qZ3xSC",
  "response": {
    "object": "realtime.response",
    "id": "resp_AE7AFASlfCpqbTe6KcVtn",
    "status": "in_progress",
    "status_details": null,
    "output": [],
    "usage": null
  }
}
{
  "type": "response.output_item.added",
  "event_id": "event_AE7AFnWqCABzxHM9na8ue",
  "response_id": "resp_AE7AFASlfCpqbTe6KcVtn",
  "output_index": 0,
  "item": {
    "id": "item_AE7AFkYdCGa3lhYlLg3sN",
    "object": "realtime.item",
    "type": "message",
    "status": "in_progress",
    "role": "assistant",
    "content": []
  }
}
{
  "type": "conversation.item.created",
  "event_id": "event_AE7AFCmHfZC2ZSQPSOV1t",
  "previous_item_id": "item_AE7AEZvFcBygwZFQ8o4Yp",
  "item": {
    "id": "item_AE7AFkYdCGa3lhYlLg3sN",
    "object": "realtime.item",
    "type": "message",
    "status": "in_progress",
    "role": "assistant",
    "content": []
  }
}
{
  "type": "response.content_part.added",
  "event_id": "event_AE7AFGtVIAXh06TgOvn49",
  "response_id": "resp_AE7AFASlfCpqbTe6KcVtn",
  "item_id": "item_AE7AFkYdCGa3lhYlLg3sN",
  "output_index": 0,
  "content_index": 0,
  "content": {
    "type": "audio",
    "transcript": ""
  },
  "part": {
    "type": "audio",
    "transcript": ""
  }
}
2response.audio_transcript.delta
conversation.item.input_audio_transcription.completed
2response.audio.delta
response.audio_transcript.delta
response.audio.delta
response.audio_transcript.delta
2response.audio.delta
5response.audio_transcript.delta
2response.audio.delta
response.audio_transcript.delta
response.audio.delta
4response.audio_transcript.delta
4response.audio.delta
{
  "type": "response.audio.done",
  "event_id": "event_AE7AG9Ksd628cdmW6Bju2",
  "response_id": "resp_AE7AFASlfCpqbTe6KcVtn",
  "item_id": "item_AE7AFkYdCGa3lhYlLg3sN",
  "output_index": 0,
  "content_index": 0
}
{
  "type": "response.audio_transcript.done",
  "event_id": "event_AE7AGTnbc2z0iHlgxIupb",
  "response_id": "resp_AE7AFASlfCpqbTe6KcVtn",
  "item_id": "item_AE7AFkYdCGa3lhYlLg3sN",
  "output_index": 0,
  "content_index": 0,
  "transcript": "こんにちは!今日はどんなお手伝いが必要ですか?"
}
{
  "type": "response.content_part.done",
  "event_id": "event_AE7AGLjBa3USXa0f60vVU",
  "response_id": "resp_AE7AFASlfCpqbTe6KcVtn",
  "item_id": "item_AE7AFkYdCGa3lhYlLg3sN",
  "output_index": 0,
  "content_index": 0,
  "content": {
    "type": "audio",
    "transcript": "こんにちは!今日はどんなお手伝いが必要ですか?"
  },
  "part": {
    "type": "audio",
    "transcript": "こんにちは!今日はどんなお手伝いが必要ですか?"
  }
}
{
  "type": "response.output_item.done",
  "event_id": "event_AE7AGrxONmrXi3W7mntUy",
  "response_id": "resp_AE7AFASlfCpqbTe6KcVtn",
  "output_index": 0,
  "item": {
    "id": "item_AE7AFkYdCGa3lhYlLg3sN",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "assistant",
    "content": [
      {
        "type": "audio",
        "transcript": "こんにちは!今日はどんなお手伝いが必要ですか?"
      }
    ]
  }
}

メッセージレスポンスは一連のイベントとなる

  • response.created
  • response.output_item.added
  • conversation.item.created
  • response.content_part.added
  • response.text.delta...
  • response.text.done
  • response.content_part.done
  • response.output_item.done
  • response.done

conversation.itemが会話における1ターンで、responseはそれに結びつく内容を送ってくる感じか

Masa CentoMasa Cento
  • VAD(Voice Activity Detection)モードではユーザーの会話の区切りを自動認識してAIの返答を返す
  • VADが適切な区切りを認識しない場合は会話の途中で返答を開始するので、プロンプトを使って会話の内容を確認して待つとか制御が必要か
  • WebSocketなのでAIが答えている間も音声認識が動いており、途中で喋りかけて止めたりできる
Masa CentoMasa Cento

tool callを試す

sessionでtoolを指定 type: "function"が必要
ついでに音声を返さない modalities: ["text"]を指定

main.ts
let configMessage: SessionUpdateMessage = {
    type: "session.update",
    session: {
      modalities: ["text"],
      turn_detection: {
        type: "server_vad",
      },
      input_audio_transcription: {
        model: "whisper-1"
      },
      tool_choice: "auto",
      tools: [
        {
          name: "get_weather",
          description: "Get the weather at a given location",
          type: "function",
          parameters: {
            type: "object",
            properties: {
              location: {
                type: "string",
                description: "Location to get the weather from",
              },
              scale: {
                type: "string",
                enum: ['celsius', 'farenheit']
              },
            },
            required: ["location", "scale"],
          },
        },
      ]
    }
  };

クライアントでツール実行の結果を返すhandlerを追加

main.ts
async function handleRealtimeMessages() {
  for await (const message of realtimeStreaming.messages()) {
    let consoleLog = "" + message.type;

    switch (message.type) {
...
      case "response.text.delta":
        appendToTextBlock(message.delta);
        break;
      case "response.function_call_arguments.done":
        console.log("Function call arguments received: " + message.arguments);
        realtimeStreaming.send({
          type: "conversation.item.create",
          item: {
            type: "function_call_output",
            call_id: message.call_id,
            output: "Rainy, 20degrees celsius",
          },
        });
        realtimeStreaming.send({
          event_id: "evt_reYb9LWwV1EmL4wz2",
          type: "response.create"
        });
        break;
Masa CentoMasa Cento

tool実行結果を元に答えた
Webでtool実行するとできることは限られるだろう

ログはこんな感じ

console.log
session.created
input_audio_buffer.speech_started
{
  "type": "input_audio_buffer.speech_stopped",
  "event_id": "event_AE7U4cITJOENtUWhu06OJ",
  "audio_end_ms": 2816,
  "item_id": "item_AE7U36239vwconWGzqv8E"
}
{
  "type": "input_audio_buffer.committed",
  "event_id": "event_AE7U4vkuZiAGmlzFtZqUi",
  "previous_item_id": null,
  "item_id": "item_AE7U36239vwconWGzqv8E"
}
{
  "type": "conversation.item.created",
  "event_id": "event_AE7U4Djagn03y4RATHnKM",
  "previous_item_id": null,
  "item": {
    "id": "item_AE7U36239vwconWGzqv8E",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "transcript": null
      }
    ]
  }
}
{
  "type": "response.created",
  "event_id": "event_AE7U4lcX8QVqO8310zgBl",
  "response": {
    "object": "realtime.response",
    "id": "resp_AE7U4kDbllR0YbhcrLhRY",
    "status": "in_progress",
    "status_details": null,
    "output": [],
    "usage": null
  }
}
{
  "type": "response.output_item.added",
  "event_id": "event_AE7U46fz9maxyZAqfXQ7F",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "output_index": 0,
  "item": {
    "id": "item_AE7U4e1aHlvzDRzEGBMRX",
    "object": "realtime.item",
    "type": "function_call",
    "status": "in_progress",
    "name": "get_weather",
    "call_id": "call_oLhGAc0A0wmXA8En",
    "arguments": ""
  }
}
{
  "type": "conversation.item.created",
  "event_id": "event_AE7U45MZM3CVFOdHM8Qlw",
  "previous_item_id": "item_AE7U36239vwconWGzqv8E",
  "item": {
    "id": "item_AE7U4e1aHlvzDRzEGBMRX",
    "object": "realtime.item",
    "type": "function_call",
    "status": "in_progress",
    "name": "get_weather",
    "call_id": "call_oLhGAc0A0wmXA8En",
    "arguments": ""
  }
}
{
  "type": "response.function_call_arguments.delta",
  "event_id": "event_AE7U4LQmcP9xX9XprHoYA",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "output_index": 0,
  "call_id": "call_oLhGAc0A0wmXA8En",
  "delta": "{\n"
}
{
  "type": "response.function_call_arguments.delta",
  "event_id": "event_AE7U4ijAKw3gsNkSgiKow",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "output_index": 0,
  "call_id": "call_oLhGAc0A0wmXA8En",
  "delta": " "
}
{
  "type": "response.function_call_arguments.delta",
  "event_id": "event_AE7U4j62F6MyIYdWig54d",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "output_index": 0,
  "call_id": "call_oLhGAc0A0wmXA8En",
  "delta": " \""
}
{
  "type": "response.function_call_arguments.delta",
  "event_id": "event_AE7U4W72ueJ7LnZ9y2074",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "output_index": 0,
  "call_id": "call_oLhGAc0A0wmXA8En",
  "delta": "location"
}
{
  "type": "response.function_call_arguments.delta",
  "event_id": "event_AE7U4MVUQToeciuc3LTke",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "output_index": 0,
  "call_id": "call_oLhGAc0A0wmXA8En",
  "delta": "\":"
}
{
  "type": "response.function_call_arguments.delta",
  "event_id": "event_AE7U4TRFYG3woPpFRHeYk",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "output_index": 0,
  "call_id": "call_oLhGAc0A0wmXA8En",
  "delta": " \""
}
{
  "type": "response.function_call_arguments.delta",
  "event_id": "event_AE7U4SpuqAPv4ZDJjjfR8",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "output_index": 0,
  "call_id": "call_oLhGAc0A0wmXA8En",
  "delta": "Tokyo"
}
{
  "type": "response.function_call_arguments.delta",
  "event_id": "event_AE7U4TBCB0cERS5U26TW6",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "output_index": 0,
  "call_id": "call_oLhGAc0A0wmXA8En",
  "delta": "\",\n"
}
{
  "type": "response.function_call_arguments.delta",
  "event_id": "event_AE7U4Ztg0HUiN72eYd4eJ",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "output_index": 0,
  "call_id": "call_oLhGAc0A0wmXA8En",
  "delta": " "
}
conversation.item.input_audio_transcription.completed
{
  "type": "response.function_call_arguments.delta",
  "event_id": "event_AE7U4Xs91THYsVhYtZM1E",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "output_index": 0,
  "call_id": "call_oLhGAc0A0wmXA8En",
  "delta": " \""
}
{
  "type": "response.function_call_arguments.delta",
  "event_id": "event_AE7U40Oe5Hoy85DXqr7gW",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "output_index": 0,
  "call_id": "call_oLhGAc0A0wmXA8En",
  "delta": "scale"
}
{
  "type": "response.function_call_arguments.delta",
  "event_id": "event_AE7U4m9RD53JkI1w05z61",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "output_index": 0,
  "call_id": "call_oLhGAc0A0wmXA8En",
  "delta": "\":"
}
{
  "type": "response.function_call_arguments.delta",
  "event_id": "event_AE7U4Jc1L93WBVOYXycPF",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "output_index": 0,
  "call_id": "call_oLhGAc0A0wmXA8En",
  "delta": " \""
}
{
  "type": "response.function_call_arguments.delta",
  "event_id": "event_AE7U4W321YLcia3iku6Yg",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "output_index": 0,
  "call_id": "call_oLhGAc0A0wmXA8En",
  "delta": "c"
}
{
  "type": "response.function_call_arguments.delta",
  "event_id": "event_AE7U4mPIvreecLKGzhKih",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "output_index": 0,
  "call_id": "call_oLhGAc0A0wmXA8En",
  "delta": "elsius"
}
{
  "type": "response.function_call_arguments.delta",
  "event_id": "event_AE7U4CXbwNo7d7bhMMct6",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "output_index": 0,
  "call_id": "call_oLhGAc0A0wmXA8En",
  "delta": "\"\n"
}
{
  "type": "response.function_call_arguments.delta",
  "event_id": "event_AE7U4GdaEqDIZxIqczEZ7",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "output_index": 0,
  "call_id": "call_oLhGAc0A0wmXA8En",
  "delta": "}"
}
Function call arguments received: {
  "location": "Tokyo",
  "scale": "celsius"
}
response.function_call_arguments.done
{
  "type": "response.output_item.done",
  "event_id": "event_AE7U4dQkD00VBU5JAU5zN",
  "response_id": "resp_AE7U4kDbllR0YbhcrLhRY",
  "output_index": 0,
  "item": {
    "id": "item_AE7U4e1aHlvzDRzEGBMRX",
    "object": "realtime.item",
    "type": "function_call",
    "status": "completed",
    "name": "get_weather",
    "call_id": "call_oLhGAc0A0wmXA8En",
    "arguments": "{\n  \"location\": \"Tokyo\",\n  \"scale\": \"celsius\"\n}"
  }
}
response.done
{
  "type": "conversation.item.created",
  "event_id": "event_AE7U5jKNqruEAQWtfFojp",
  "previous_item_id": "item_AE7U4e1aHlvzDRzEGBMRX",
  "item": {
    "id": "item_AE7U5hkbe9W0sKcTwyQbV",
    "object": "realtime.item",
    "type": "function_call_output",
    "status": "completed",
    "call_id": "call_oLhGAc0A0wmXA8En",
    "output": "Rainy, 20degrees celsius"
  }
}
{
  "type": "response.created",
  "event_id": "event_AE7U5yfADQZ6mwr421pc6",
  "response": {
    "object": "realtime.response",
    "id": "resp_AE7U5ZWqCEBxOw4zQRc0T",
    "status": "in_progress",
    "status_details": null,
    "output": [],
    "usage": null
  }
}
{
  "type": "response.output_item.added",
  "event_id": "event_AE7U5PIFyLCv3B1x8Dd3Y",
  "response_id": "resp_AE7U5ZWqCEBxOw4zQRc0T",
  "output_index": 0,
  "item": {
    "id": "item_AE7U5IaYNwtRnNDZHP0ig",
    "object": "realtime.item",
    "type": "message",
    "status": "in_progress",
    "role": "assistant",
    "content": []
  }
}
{
  "type": "conversation.item.created",
  "event_id": "event_AE7U5aL9eFQ4H86ZUvv75",
  "previous_item_id": "item_AE7U5hkbe9W0sKcTwyQbV",
  "item": {
    "id": "item_AE7U5IaYNwtRnNDZHP0ig",
    "object": "realtime.item",
    "type": "message",
    "status": "in_progress",
    "role": "assistant",
    "content": []
  }
}
{
  "type": "response.content_part.added",
  "event_id": "event_AE7U5t6L4NsnUtFaZ5vqJ",
  "response_id": "resp_AE7U5ZWqCEBxOw4zQRc0T",
  "item_id": "item_AE7U5IaYNwtRnNDZHP0ig",
  "output_index": 0,
  "content_index": 0,
  "content": {
    "type": "text",
    "text": ""
  },
  "part": {
    "type": "text",
    "text": ""
  }
}
35response.text.delta
{
  "type": "response.text.done",
  "event_id": "event_AE7U5q537gzp3yttYPej7",
  "response_id": "resp_AE7U5ZWqCEBxOw4zQRc0T",
  "item_id": "item_AE7U5IaYNwtRnNDZHP0ig",
  "output_index": 0,
  "content_index": 0,
  "text": "東京の天気は雨で、気温は20度です。雨が続いているので、傘を忘れずにお出かけくださいね。"
}
{
  "type": "response.content_part.done",
  "event_id": "event_AE7U5Y1uf1r2R05apsoE0",
  "response_id": "resp_AE7U5ZWqCEBxOw4zQRc0T",
  "item_id": "item_AE7U5IaYNwtRnNDZHP0ig",
  "output_index": 0,
  "content_index": 0,
  "content": {
    "type": "text",
    "text": "東京の天気は雨で、気温は20度です。雨が続いているので、傘を忘れずにお出かけくださいね。"
  },
  "part": {
    "type": "text",
    "text": "東京の天気は雨で、気温は20度です。雨が続いているので、傘を忘れずにお出かけくださいね。"
  }
}
{
  "type": "response.output_item.done",
  "event_id": "event_AE7U5LRy8trnKf0myOdQZ",
  "response_id": "resp_AE7U5ZWqCEBxOw4zQRc0T",
  "output_index": 0,
  "item": {
    "id": "item_AE7U5IaYNwtRnNDZHP0ig",
    "object": "realtime.item",
    "type": "message",
    "status": "completed",
    "role": "assistant",
    "content": [
      {
        "type": "text",
        "text": "東京の天気は雨で、気温は20度です。雨が続いているので、傘を忘れずにお出かけくださいね。"
      }
    ]
  }
}
response.done
Masa CentoMasa Cento

まとめ

  • AzureのOpenAI Realtime APIは公開されているがOpenAIのAPIと認証が違ってそのままではOpenAI側のクライアントを利用できない
  • 日本語は音声として通じるが、Transcription表示が韓国語やロシア語になったりする
    • この辺の微調整がOpenAI API公開が遅れている理由?
  • セッション自体はStatefulだが、長期保存はされないので過去の会話に復帰したい場合は履歴は保存して渡す必要がある
  • APIキー隠蔽やToolcallを含め、バックエンドを作り込む必要がある
  • テストが難しいのでプロンプトエンジニアリングの難易度はかなり高い
    • 逆に言えばテストのために送られてきた音声を保存するサービスも当然でてくると思うので利用前に規約を読んだほうがいい
  • 音声認識した分だけのtokenであればさほど入力のコストはかからないのではないか
  • AIがずっと喋り続けてユーザーがたまに反応するようなケースはかなりコストが高そう
  • ツールといかに連携させるかがRealtime API利用にとって重要だろう
このスクラップは2ヶ月前にクローズされました