⛄

Structured Outputsのコールドスタート問題は解消されたのか？

koki takeishi

2025/05/26に公開

 はじめにLLMでjsonを出力させる際に、プロンプトにjsonのスキーマを書いて制御するのと、Structured Outputの速度って直感的にはStructured Outputの方が早いのかなと思ってo3に聞いてみました。

 o3の回答

方法
仕組み
1 回の API 呼び出しのレイテンシ
総合的な開発・運用コスト
備考


① プロンプトで「JSON で返して」と書く
単なる自然言語指示
速い（追加オーバーヘッドなし）
失敗時のリトライやパース失敗が多く、結果的に遅延が膨らみやすい
出力が壊れると業務ロジック側で再試行・修正が必要

② JSON mode（"response_format":{"type":"json_object"}）
「必ず JSON」に制約
①とほぼ同じ〜わずかに増加
JSON 破損は激減。スキーマは保証されないのでバリデーションは必要
プロンプト量を減らせる＝入力トークン節約


③ Structured Outputs（関数/response format 両方）
JSON Schema を CFG に変換し動的マスキング

初回：+10〜60 s（スキーマ解析）→ 2 回目以降：①+数百 ms程度
スキーマ完全準拠なのでリトライ激減。保守コストが大幅に下がる
初回遅延はキャッシュ後に解消。複数スキーマを多用する場合は注意



ユースケース
推奨設定



高速応答が最優先（チャット UI で 1 〜 2 秒でも遅いと UX が下がる）
JSON mode。出力長を短くし、ストリーミングを有効化して最初のトークン到着を早める


出力の厳格さが最優先（ETL、データ抽出、信頼性重視 API）
Structured Outputs (strict:true)。初回キャッシュの遅延を許容し、同じスキーマを使い回す

PoC/プロトタイプでまず動かしたい
プロンプトに「以下の JSON で返答」と書くシンプル方式。問題が顕在化したら上位手段に移行

思ってたのと違うというか、Cold Startみたいな問題がStructured Outputにはあると回答がきました。

 3つの方法を比較するスクリプトを考えてもらったモデルはgpt-4o-miniで、Azure OpenAI Serviceで試してみました。

 シンプルjsonスキーマ# Azure OpenAI Service – Latency & Quality Comparison for Three Output Modes
# ---------------------------------------------------------------------------
# 1. Prompt‑only JSON directive
# 2. JSON mode ("response_format": {"type": "json_object"})
# 3. Structured Outputs / Function Calling with strict JSON schema
#
# This script measures **cold‑start** (first request) and **warm** performance
# separately, so you can see the one‑time schema compilation cost of Structured
# Outputs. Save as `azure_openai_mode_comparison.py` or paste into a notebook.
# ---------------------------------------------------------------------------

# %% Imports & basic config
import os
import json
import time
import uuid
import statistics as stats
from typing import Dict, List, Tuple

import openai  # azure‑openai ≥ 1.14.0
from jsonschema import validate, ValidationError

# ---------------------------------------------------------------------------
# REQUIRED ENV VARIABLES (set before running):
#   AZURE_OPENAI_ENDPOINT   – e.g. "https://my‑resource.openai.azure.com/"
#   AZURE_OPENAI_KEY        – your key string
#   AZURE_OPENAI_DEPLOYMENT – model deployment name (e.g. "gpt‑4o‑turbo")
# ---------------------------------------------------------------------------
openai.azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
openai.api_key = os.getenv("AZURE_OPENAI_KEY")
DEPLOYMENT = os.getenv("AZURE_OPENAI_DEPLOYMENT")
assert all(
    [openai.azure_endpoint, openai.api_key, DEPLOYMENT]
), "Set AZURE_OPENAI_ENDPOINT / AZURE_OPENAI_KEY / AZURE_OPENAI_DEPLOYMENT env vars first!"

# %% Test prompt & JSON schema definition
USER_PROMPT = (
    "与えられた氏名 name と年齢 age を含む JSON オブジェクトを返してください。\n"
    "キーは必ず name と age の 2 つだけとし、name は文字列、age は整数。"
)
TEST_INPUT = {"name": "Taro", "age": 29}

SCHEMA: Dict = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
    },
    "required": ["name", "age"],
    "additionalProperties": False,
}

# %% Helper utilities --------------------------------------------------------


def _timed_request(**kwargs) -> Tuple[object, float]:
    """Send ChatCompletion and measure latency (seconds)."""
    t0 = time.perf_counter()
    res = openai.chat.completions.create(**kwargs)
    return res, time.perf_counter() - t0


def _is_valid_json(txt: str) -> bool:
    try:
        json.loads(txt)
        return True
    except json.JSONDecodeError:
        return False


def _validate_schema(obj) -> bool:
    try:
        validate(instance=obj, schema=SCHEMA)
        return True
    except ValidationError:
        return False


# %% Mode wrappers -----------------------------------------------------------


def call_prompt_only(n: int = 1) -> Tuple[List[float], int]:
    """Prompt‑only JSON directive."""
    latencies, failures = [], 0
    system_msg = "You are a bot that returns ONLY a compact JSON object; do not add any explanations."
    for _ in range(n):
        res, dur = _timed_request(
            model=DEPLOYMENT,
            messages=[
                {"role": "system", "content": system_msg},
                {
                    "role": "user",
                    "content": USER_PROMPT + f"\nInput:\n{TEST_INPUT}",
                },
            ],
            temperature=0,
            max_tokens=50,
        )
        txt = res.choices[0].message.content.strip()
        ok = _is_valid_json(txt) and _validate_schema(json.loads(txt))
        failures += 0 if ok else 1
        latencies.append(dur)
    return latencies, failures


def call_json_mode(n: int = 1) -> Tuple[List[float], int]:
    latencies, failures = [], 0
    for _ in range(n):
        res, dur = _timed_request(
            model=DEPLOYMENT,
            messages=[
                {
                    "role": "user",
                    "content": USER_PROMPT + f"\nInput:\n{TEST_INPUT}",
                }
            ],
            response_format={"type": "json_object"},
            temperature=0,
            max_tokens=50,
        )
        txt = res.choices[0].message.content.strip()
        ok = _is_valid_json(txt) and _validate_schema(json.loads(txt))
        failures += 0 if ok else 1
        latencies.append(dur)
    return latencies, failures


def call_structured_outputs(
    n: int = 1, unique_schema: bool = False
) -> Tuple[List[float], int]:
    """Structured Outputs (function calling) with strict schema.

    If ``unique_schema`` is True, each cold run will mutate the schema slightly
    (adds a dummy property) so that OpenAI must re‑process the schema and thus
    reveal the real compilation cost.
    """
    latencies, failures = [], 0

    # Optionally mutate schema so its hash changes and cache cannot hit
    if unique_schema:
        mutated = SCHEMA.copy()
        # Add a dummy description that changes every run
        mutated["description"] = f"dummy-{uuid.uuid4().hex[:8]}"
        active_schema = mutated
    else:
        active_schema = SCHEMA

    fn_name = (
        f"return_profile_{uuid.uuid4().hex[:8]}" if unique_schema else "return_profile"
    )

    tools = [
        {
            "type": "function",
            "function": {
                "name": fn_name,
                "description": "Returns a profile JSON with name and age.",
                "parameters": active_schema,
            },
        }
    ]
    user_content = USER_PROMPT + f"Input:{TEST_INPUT}"

    for _ in range(n):
        res, dur = _timed_request(
            model=DEPLOYMENT,
            messages=[{"role": "user", "content": user_content}],
            tools=tools,
            tool_choice={"type": "function", "function": {"name": fn_name}},
            temperature=0,
            max_tokens=50,
        )
        try:
            raw = res.choices[0].message.tool_calls[0].function.arguments
            obj = json.loads(raw) if isinstance(raw, str) else raw
            ok = _validate_schema(obj)
        except Exception:
            ok = False
        failures += 0 if ok else 1
        latencies.append(dur)
    return latencies, failures


# %% Benchmark runner --------------------------------------------------------


def benchmark(runs: int = 10):
    """Run cold + warm tests for each mode."""
    results = {}

    # Prompt‑only ----------------------------------------------------------------
    cold_lat, cold_fail = call_prompt_only(1)
    warm_lat, warm_fail = call_prompt_only(runs)
    results["Prompt JSON"] = _summarize(cold_lat, cold_fail, warm_lat, warm_fail)

    # JSON mode ------------------------------------------------------------------
    cold_lat, cold_fail = call_json_mode(1)
    warm_lat, warm_fail = call_json_mode(runs)
    results["JSON mode"] = _summarize(cold_lat, cold_fail, warm_lat, warm_fail)

    # Structured Outputs ---------------------------------------------------------
    unique_name = f"return_profile_{uuid.uuid4().hex}"  # ensure cold compile
    cold_lat, cold_fail = call_structured_outputs(1, unique_schema=True)
    warm_lat, warm_fail = call_structured_outputs(runs, unique_schema=True)
    results["Structured"] = _summarize(cold_lat, cold_fail, warm_lat, warm_fail)

    _print_results(results, runs)


def _summarize(
    cold_lat: List[float], cold_fail: int, warm_lat: List[float], warm_fail: int
):
    return {
        "cold_s": round(cold_lat[0], 3),
        "warm_avg_s": round(stats.mean(warm_lat), 3) if warm_lat else None,
        "warm_p95_s": (
            round(stats.quantiles(warm_lat, n=20)[-1], 3)
            if len(warm_lat) >= 2
            else None
        ),
        "failures": cold_fail + warm_fail,
    }


def _print_results(results: Dict[str, Dict], runs: int):
    print(f"\nBenchmark Summary (warm runs = {runs})\n" + "-" * 55)
    print("Mode            | cold (s) | warm_avg (s) | warm_p95 (s) | failures")
    print("--------------- | -------- | ------------ | ------------ | --------")
    for mode, vals in results.items():
        print(
            f"{mode:15} | {vals['cold_s']:8.3f} | {vals['warm_avg_s']:12.3f} | {vals['warm_p95_s']:12.3f} | {vals['failures']:8d}"
        )


# %% Entrypoint --------------------------------------------------------------
if __name__ == "__main__":
    benchmark(runs=10)


 Benchmark Summary (warm runs = 10)

Mode
cold (s)
warm_avg (s)
warm_p95 (s)
failures


Prompt JSON
1.530
0.637
0.796
6

JSON mode
0.714
0.645
0.782
0

Structured
0.600
0.577
1.106
0

結果、初回のcoldは別に遅くなっていないなと。むしろPrompt JSONが遅いと思いました。

 複雑なjsonスキーマo3に何で？と聞いたら、スキーマがシンプル過ぎて、初回扱いになっていないのでは？

と回答が来たので、下記も試します。
# Azure OpenAI Service – Latency & Quality Comparison with **Complex** JSON Schema
# ---------------------------------------------------------------------------
# 1. Prompt‑only JSON directive
# 2. JSON mode ("response_format": {"type": "json_object"})
# 3. Structured Outputs / Function Calling with strict JSON schema
#
# The schema now includes:
#   - Nested `address` object (street, city, postal_code)
#   - `contacts` array of objects { type (enum), value }
#   - `preferences` object with boolean & integer array + nullable field
#   - Enum, regex pattern, min/max, required/optional mix
#
# This should trigger a noticeably heavier CFG compilation step when
# `unique_schema=True` is enabled.
# ---------------------------------------------------------------------------

# %% Imports & basic config
import os
import json
import time
import uuid
import statistics as stats
from typing import Dict, List, Tuple

import openai  # azure‑openai ≥ 1.14.0
from jsonschema import validate, ValidationError

# ---------------------------------------------------------------------------
# REQUIRED ENV VARIABLES (set before running):
#   AZURE_OPENAI_ENDPOINT   – e.g. "https://my-resource.openai.azure.com/"
#   AZURE_OPENAI_KEY        – your key string
#   AZURE_OPENAI_DEPLOYMENT – model deployment name (e.g. "gpt-4o-turbo")
# ---------------------------------------------------------------------------
openai.azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
openai.api_key = os.getenv("AZURE_OPENAI_KEY")
DEPLOYMENT = os.getenv("AZURE_OPENAI_DEPLOYMENT")

assert all(
    [openai.azure_endpoint, openai.api_key, DEPLOYMENT]
), "Set AZURE_OPENAI_ENDPOINT / AZURE_OPENAI_KEY / AZURE_OPENAI_DEPLOYMENT env vars first!"

# %% Complex JSON Schema & Prompt -------------------------------------------
USER_PROMPT = (
    "以下の入力 JSON (Input:) と同じ情報を含む JSON オブジェクトを返してください。\n"
    "必ず指定のスキーマに完全準拠し、それ以外のキーは含めないでください。"
)

TEST_INPUT = {
    "name": "Taro Yamamoto",
    "age": 29,
    "address": {"street": "1-2-3 Ginza", "city": "Tokyo", "postal_code": "104-0061"},
    "contacts": [
        {"type": "email", "value": "taro@example.com"},
        {"type": "phone", "value": "+81-90-1234-5678"},
    ],
    "preferences": {
        "newsletterSubscribed": True,
        "favoriteNumbers": [3, 7, 42],
        "preferredLocale": None,
    },
}

SCHEMA: Dict = {
    "type": "object",
    "properties": {
        "name": {"type": "string", "minLength": 1},
        "age": {"type": "integer", "minimum": 0, "maximum": 150},
        "address": {
            "type": "object",
            "properties": {
                "street": {"type": "string"},
                "city": {"type": "string"},
                "postal_code": {"type": "string", "pattern": "^[0-9]{3}-?[0-9]{4}$"},
            },
            "required": ["street", "city", "postal_code"],
            "additionalProperties": False,
        },
        "contacts": {
            "type": "array",
            "minItems": 1,
            "items": {
                "type": "object",
                "properties": {
                    "type": {"type": "string", "enum": ["email", "phone", "fax"]},
                    "value": {"type": "string", "minLength": 3},
                },
                "required": ["type", "value"],
                "additionalProperties": False,
            },
        },
        "preferences": {
            "type": "object",
            "properties": {
                "newsletterSubscribed": {"type": "boolean"},
                "favoriteNumbers": {
                    "type": "array",
                    "items": {"type": "integer"},
                    "minItems": 0,
                    "maxItems": 5,
                },
                "preferredLocale": {"type": ["string", "null"]},
            },
            "required": ["newsletterSubscribed", "favoriteNumbers"],
            "additionalProperties": False,
        },
    },
    "required": ["name", "age", "address", "contacts", "preferences"],
    "additionalProperties": False,
}

# %% Helper utilities --------------------------------------------------------


def _timed_request(**kwargs) -> Tuple[object, float]:
    t0 = time.perf_counter()
    res = openai.chat.completions.create(**kwargs)
    return res, time.perf_counter() - t0


def _is_valid_json(txt: str) -> bool:
    try:
        json.loads(txt)
        return True
    except json.JSONDecodeError:
        return False


def _validate_schema(obj) -> bool:
    try:
        validate(instance=obj, schema=SCHEMA)
        return True
    except ValidationError:
        return False


# %% Mode wrappers -----------------------------------------------------------


def call_prompt_only(n: int = 1) -> Tuple[List[float], int]:
    latencies, failures = [], 0
    system_msg = "You are a bot that returns ONLY the JSON object. No explanations."
    for _ in range(n):
        res, dur = _timed_request(
            model=DEPLOYMENT,
            messages=[
                {"role": "system", "content": system_msg},
                {
                    "role": "user",
                    "content": USER_PROMPT
                    + f"\nInput:\n{json.dumps(TEST_INPUT, ensure_ascii=False)}",
                },
            ],
            temperature=0,
            max_tokens=400,
        )
        txt = res.choices[0].message.content.strip()
        ok = _is_valid_json(txt) and _validate_schema(json.loads(txt))
        failures += 0 if ok else 1
        latencies.append(dur)
    return latencies, failures


def call_json_mode(n: int = 1) -> Tuple[List[float], int]:
    latencies, failures = [], 0
    for _ in range(n):
        res, dur = _timed_request(
            model=DEPLOYMENT,
            messages=[
                {
                    "role": "user",
                    "content": USER_PROMPT
                    + f"\nInput:\n{json.dumps(TEST_INPUT, ensure_ascii=False)}",
                }
            ],
            response_format={"type": "json_object"},
            temperature=0,
            max_tokens=400,
        )
        txt = res.choices[0].message.content.strip()
        ok = _is_valid_json(txt) and _validate_schema(json.loads(txt))
        failures += 0 if ok else 1
        latencies.append(dur)
    return latencies, failures


def call_structured_outputs(
    n: int = 1, unique_schema: bool = False
) -> Tuple[List[float], int]:
    latencies, failures = [], 0

    # Optionally mutate schema so cache definitely misses
    active_schema = SCHEMA.copy()
    if unique_schema:
        active_schema["description"] = f"dummy-{uuid.uuid4().hex[:8]}"

    fn_name = (
        f"return_profile_{uuid.uuid4().hex[:8]}" if unique_schema else "return_profile"
    )

    tools = [
        {
            "type": "function",
            "function": {
                "name": fn_name,
                "description": "Returns a complex profile JSON.",
                "parameters": active_schema,
            },
        }
    ]

    user_content = (
        USER_PROMPT + f"\nInput:\n{json.dumps(TEST_INPUT, ensure_ascii=False)}"
    )

    for _ in range(n):
        res, dur = _timed_request(
            model=DEPLOYMENT,
            messages=[{"role": "user", "content": user_content}],
            tools=tools,
            tool_choice={"type": "function", "function": {"name": fn_name}},
            temperature=0,
            max_tokens=400,
        )
        try:
            raw = res.choices[0].message.tool_calls[0].function.arguments
            obj = json.loads(raw) if isinstance(raw, str) else raw
            ok = _validate_schema(obj)
        except Exception:
            ok = False
        failures += 0 if ok else 1
        latencies.append(dur)
    return latencies, failures


# %% Benchmark runner --------------------------------------------------------


def benchmark(runs: int = 10, cold_unique: bool = True):
    results = {}

    # Prompt JSON ------------------------------------------------------------
    cold_lat, cold_fail = call_prompt_only(1)
    warm_lat, warm_fail = call_prompt_only(runs)
    results["Prompt JSON"] = _summarize(cold_lat, cold_fail, warm_lat, warm_fail)

    # JSON mode --------------------------------------------------------------
    cold_lat, cold_fail = call_json_mode(1)
    warm_lat, warm_fail = call_json_mode(runs)
    results["JSON mode"] = _summarize(cold_lat, cold_fail, warm_lat, warm_fail)

    # Structured -------------------------------------------------------------
    cold_lat, cold_fail = call_structured_outputs(1, unique_schema=cold_unique)
    warm_lat, warm_fail = call_structured_outputs(runs, unique_schema=False)
    results["Structured"] = _summarize(cold_lat, cold_fail, warm_lat, warm_fail)

    _print_results(results, runs)


def _summarize(
    cold_lat: List[float], cold_fail: int, warm_lat: List[float], warm_fail: int
):
    return {
        "cold_s": round(cold_lat[0], 3),
        "warm_avg_s": round(stats.mean(warm_lat), 3) if warm_lat else None,
        "warm_p95_s": (
            round(stats.quantiles(warm_lat, n=20)[-1], 3)
            if len(warm_lat) >= 2
            else None
        ),
        "failures": cold_fail + warm_fail,
    }


def _print_results(results: Dict[str, Dict], runs: int):
    print(f"\nBenchmark Summary (warm runs = {runs})\n" + "-" * 60)
    print("Mode            | cold (s) | warm_avg (s) | warm_p95 (s) | failures")
    print("--------------- | -------- | ------------ | ------------ | --------")
    for mode, vals in results.items():
        print(
            f"{mode:15} | {vals['cold_s']:8.3f} | {vals['warm_avg_s']:12.3f} | {vals['warm_p95_s']:12.3f} | {vals['failures']:8d}"
        )


# %% Entrypoint --------------------------------------------------------------
if __name__ == "__main__":
    benchmark(runs=10, cold_unique=True)
    # Set cold_unique=False if you want to test a cached first run instead.


 Benchmark Summary (warm runs = 10)

Mode
cold (s)
warm_avg (s)
warm_p95 (s)
failures


Prompt JSON
3.752
1.861
2.364
0

JSON mode
2.310
2.398
7.119
0

Structured
1.065
1.462
2.459
0

解消されていそう。
再度、o3に聞いてみたところ、下記の回答が返ってきました。
Cold Start 遅延は “見えにくくなる” ほど改善 されたが、完全に解消ではない。
あなたのベンチマークは「キャッシュ & 改善後パイプライン」という好条件が揃い、初回≒サブ秒 で済んだ。
複雑スキーマ・キャッシュミス・混雑リージョンでは、今も 二桁秒の遅延 報告が継続中。
設計時は最悪ケース (10-60 s) を許容できるか を確認し、ウォームアップやスキーマ最適化でリスク緩和を図るのが現実解です。

 結論中途半端な答えにはなりますが、基本的にStructured Outputを採用で良さそうだけど、初回やキャッシュがクリアされたタイミング、モデル更新のタイミングなどで遅延が発生する可能性があることは知っておきたい。

かなーと。

方法	仕組み	1 回の API 呼び出しのレイテンシ	総合的な開発・運用コスト	備考
① プロンプトで「JSON で返して」と書く	単なる自然言語指示	速い（追加オーバーヘッドなし）	失敗時のリトライやパース失敗が多く、結果的に遅延が膨らみやすい	出力が壊れると業務ロジック側で再試行・修正が必要
② JSON mode（`"response_format":{"type":"json_object"}`）	「必ず JSON」に制約	①とほぼ同じ〜わずかに増加	JSON 破損は激減。スキーマは保証されないのでバリデーションは必要	プロンプト量を減らせる＝入力トークン節約
③ Structured Outputs（関数/response format 両方）	JSON Schema を CFG に変換し動的マスキング	初回：+10〜60 s（スキーマ解析）→ 2 回目以降：①+数百 ms程度	スキーマ完全準拠なのでリトライ激減。保守コストが大幅に下がる	初回遅延はキャッシュ後に解消。複数スキーマを多用する場合は注意

ユースケース	推奨設定
高速応答が最優先（チャット UI で 1 〜 2 秒でも遅いと UX が下がる）	JSON mode。出力長を短くし、ストリーミングを有効化して最初のトークン到着を早める
出力の厳格さが最優先（ETL、データ抽出、信頼性重視 API）	Structured Outputs (`strict:true`)。初回キャッシュの遅延を許容し、同じスキーマを使い回す
PoC/プロトタイプでまず動かしたい	プロンプトに「以下の JSON で返答」と書くシンプル方式。問題が顕在化したら上位手段に移行

Mode	cold (s)	warm_avg (s)	warm_p95 (s)	failures
Prompt JSON	1.530	0.637	0.796	6
JSON mode	0.714	0.645	0.782	0
Structured	0.600	0.577	1.106	0

Mode	cold (s)	warm_avg (s)	warm_p95 (s)
Prompt JSON	3.752	1.861	2.364
JSON mode	2.310	2.398	7.119
Structured	1.065	1.462	2.459

ヘッドウォータース

株式会社ヘッドウォータースのテックブログです。 AIエージェント、生成AI、LLM、Azureのサービスや資格、IoT、XR系などData&AIとApp modernizeに関して幅広く投稿します！

はじめに

o3の回答

3つの方法を比較するスクリプトを考えてもらった

シンプルjsonスキーマ

Benchmark Summary (warm runs = 10)

複雑なjsonスキーマ

Benchmark Summary (warm runs = 10)

結論

Discussion