iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🧪

Focus on the Validator, Not the Prompt: Managing AI Agents with 10 Golden Cases

に公開

Rather Than Nurturing Prompts, Nurture the Verifier: Running Operations with 10 Golden Cases

When you introduce AI agents into your business operations, the reality you face usually boils down to these two facts:

  1. Models and prompts are unstable (due to updates, environment changes, external tools, or input variations).
  2. The main cause of failure is "lack" rather than "violation" (missing criteria, missing premises).

In our previous articles, we established the following separation of roles:

  • LLM only generates a "proposal (plan)."
  • Execution is restricted to Typed Actions.
  • The Verifier (deterministic) returns ACCEPT / REJECT / DEGRADE.

So, what should you do next?

You need a mechanism to "nurture" the system so it doesn't break during operation. The conclusion is simple: what you should do first is not "prompt improvement," but rather fixing 10 golden cases.


0. Premises of this design (reiterated)

Since LLMs are probabilistic models, output variation itself is not a vice. What is bad is executing volatile output as is.

The minimal breakdown of business operations is as follows:

  • LLM: Creates a proposal (plan).
  • Verifier: Performs deterministic acceptance/rejection on the proposal and normalizes it if necessary.
  • Execution System: Executes only verified Typed Actions (Dry-run → Approval → Production).

1. Why do even "10 cases" work?

You might wonder if 10 cases are enough, but the goal is not total coverage.

The role of golden cases is to "fix the things that kill your operations"

In business, the fatal issues are usually these:

  • Boundary values (deadlines, upper limits, ratios, state transitions).
  • Prohibitions (out of authorization, SoD violations, legal holds, mixed-in forbidden fields).
  • Deficiencies (missing identity verification, missing approvals, missing evidence, missing observation).
  • Ordering (lack of rollback premises, lack of withdrawal, risk of double execution).
  • Normalization of exceptions (continuing to rely on human interpretation without returning DEGRADE).

These can be significantly reduced by fixing a small number of "representative accidents." Start with 10 cases to build the "backbone" necessary for operations, and then grow the count based on incidents and DEGRADE logs.


2. What are golden cases? (What to fix)

The important thing is not to fix the "LLM output." What you fix is the output of the Verifier.

A golden case is a contract like this:

  • Schema’d Input + Proposed Plan (proposal).

  • The output that the Verifier must return for that input:

    • verdict (ACCEPT/REJECT/DEGRADE)
    • reasons (reason codes: rejection/hold reason)
    • missing (deficiency list: the contents of DEGRADE)
    • normalized_plan (normalized Typed Actions)

No matter how much the LLM fluctuates, if the Verifier correctly stops or correctly normalizes it, operations remain secure.


3. Minimal case format for operation

YAML is easier to read, but for portability, JSON is the best. Here, we assume JSON as the standard, and recommend an approach where you write in YAML and convert to JSON later if necessary.

3.1 Example JSON for a single case

{
  "name": "jit_access_missing_security_approval_degrade",
  "input": {
    "policy": {"policy_id": "iam-jit-access", "policy_version": "2026-01-20"},
    "access_request": {
      "request_id": "AR-2026-00077",
      "requester_user_id": "u-1234",
      "target_resource": "prod-db:billing",
      "requested_role": "db.readonly",
      "requested_duration_minutes": 60,
      "reason_code": "INCIDENT_RESPONSE",
      "incident_id": "INC-88921",
      "ticket_id": "T-2026-004512"
    },
    "approvals": {"manager_approved": true, "security_approved": false},
    "context": {"on_call": true, "break_glass": false},
    "evidence": {"runbook_id": "rbk-prod-db-read"}
  },
  "proposed_plan": {
    "actions": [
      {"name": "iam.grant_temporary_role", "params": {"user_id": "u-1234", "resource": "prod-db:billing", "role": "db.readonly", "duration_minutes": 60}},
      {"name": "iam.revoke_role", "params": {"user_id": "u-1234", "resource": "prod-db:billing", "role": "db.readonly"}}
    ]
  },
  "expect": {
    "verdict": "DEGRADE",
    "reasons": ["missing_security_approval"],
    "missing": ["approvals.security_approved"],
    "normalized_plan": []
  }
}

The points here are:

  • expect.normalized_plan is the "true executable plan" returned by the Verifier.
  • DEGRADE doesn't just "stop" the process; it returns machine-readable deficiencies (missing).

4. A template for "what 10 cases to choose"

Even if domains differ, the first 10 cases will mostly fall into the same pattern. Here is a rough recommended distribution.

4.1 Breakdown of the 10 cases (example)

  • ACCEPT: 3 cases

    • Normal case (minimal)
    • Normal case (near boundary, but OK)
    • Normal case requiring normalization (proposal → normalization → execution)
  • DEGRADE: 4 cases

    • Missing approval
    • Missing evidence
    • Undetermined state
    • Missing observation (SLO/metrics missing)
  • REJECT: 3 cases

    • Clear prohibition (privilege/SoD/legal hold, etc.)
    • Violation of deadline/time window
    • Inclusion of prohibited fields (process violation such as credit/personal information)

Fixing these 10 items creates the framework for operations.


5. 10 Golden Cases (Example: Choosing from four "hair-raising" operational tasks)

Below are 10 items that can be converted directly into cases, useful for thought experiments based on the "Typed Actions + Verifier" premise.
(The internal logic of the Verifier depends on company policy, but the "ways to break" are common.)

Since the case name and expected values are the stars here, the input is simplified. In practice, be sure to strictly define the schema.


Case 1 (ACCEPT) JIT Access: Minimal normal case

{
  "name": "jit_access_accept_minimal",
  "input": {
    "policy": { "policy_id": "iam-jit-access", "policy_version": "2026-01-20" },
    "access_request": {
      "request_id": "AR-1",
      "requester_user_id": "u-1234",
      "target_resource": "prod-db:billing",
      "requested_role": "db.readonly",
      "requested_duration_minutes": 60,
      "incident_id": "INC-1",
      "ticket_id": "T-1"
    },
    "approvals": { "manager_approved": true, "security_approved": true },
    "context": { "break_glass": false }
  },
  "proposed_plan": {
    "actions": [
      {
        "name": "iam.grant_temporary_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly",
          "duration_minutes": 60
        }
      },
      {
        "name": "iam.revoke_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly"
        }
      }
    ]
  },
  "expect": {
    "verdict": "ACCEPT",
    "reasons": [],
    "missing": [],
    "normalized_plan": [
      {
        "name": "iam.grant_temporary_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly",
          "duration_minutes": 60
        }
      },
      {
        "name": "iam.revoke_role",
        "params": {
          "user_id": "u-1234",
          "resource": "prod-db:billing",
          "role": "db.readonly"
        }
      }
    ]
  }
}

Case 2 (DEGRADE) JIT Access: Missing security approval

(Example shown in 3.1)


Case 3 (REJECT) JIT Access: Prohibited role (admin) proposal

{
  "name": "jit_access_reject_admin_role",
  "input": {
    "policy": {"policy_id": "iam-jit-access", "policy_version": "2026-01-20"},
    "access_request": {"target_resource": "prod-db:billing", "requested_role": "db.admin", "requested_duration_minutes": 30, "incident_id": "INC-2", "ticket_id": "T-2"},
    "approvals": {"manager_approved": true, "security_approved": true},
    "context": {"break_glass": false}
  },
  "proposed_plan": {"actions": [{"name": "iam.grant_temporary_role", "params": {"role": "db.admin", "duration_minutes": 30}}]},
  "expect": {
    "verdict": "REJECT",
    "reasons": ["role_not_allowed_without_break_glass"],
    "missing": [],
    "normalized_plan": []
  }
}

Case 4 (DEGRADE) Change Management: Missing rollback plan

{
  "name": "change_degrade_missing_rollback_plan",
  "input": {
    "policy": { "policy_id": "prod-change-policy", "policy_version": "2026-01-10" },
    "change_request": {
      "change_type": "feature_flag_rollout",
      "flag_key": "new_invoice_flow",
      "to": { "percent": 10 },
      "rollback_plan_id": null
    },
    "guardrails": {
      "canary": { "step_percent": [10, 25, 50, 100] },
      "slo_gates": [{ "metric": "error_rate_5m", "op": "<=", "threshold": 0.01 }]
    },
    "approvals": { "owner_approved": true, "sre_approved": true }
  },
  "proposed_plan": {
    "actions": [
      {
        "name": "feature_flag.set_percent",
        "params": { "flag_key": "new_invoice_flow", "percent": 10 }
      }
    ]
  },
  "expect": {
    "verdict": "DEGRADE",
    "reasons": ["missing_rollback_plan"],
    "missing": ["change_request.rollback_plan_id"],
    "normalized_plan": []
  }
}

Case 5 (REJECT) Change Management: No phased rollout (100% single shot)

{
  "name": "change_reject_no_canary_steps",
  "input": {
    "policy": {"policy_id": "prod-change-policy", "policy_version": "2026-01-10"},
    "change_request": {
      "change_type": "feature_flag_rollout",
      "flag_key": "new_invoice_flow",
      "rollback_plan_id": "rb-1"
    },
    "guardrails": {
      "canary": {"step_percent": [100]},
      "slo_gates": [{"metric": "error_rate_5m", "op": "<=", "threshold": 0.01}]
    },
    "approvals": {"owner_approved": true, "sre_approved": true}
  },
  "proposed_plan": {"actions": [{"name": "feature_flag.set_percent", "params": {"flag_key": "new_invoice_flow", "percent": 100}}]},
  "expect": {
    "verdict": "REJECT",
    "reasons": ["canary_steps_required"],
    "missing": [],
    "normalized_plan": []
  }
}

Case 6 (ACCEPT+Normalization) Change Management: Forced rollback hook attachment

{
  "name": "change_accept_normalize_force_rollback_hook",
  "input": {
    "policy": {"policy_id": "prod-change-policy", "policy_version": "2026-01-10"},
    "change_request": {
      "change_type": "feature_flag_rollout",
      "flag_key": "new_invoice_flow",
      "risk_level": "MEDIUM",
      "rollback_plan_id": "rb-2026-0091"
    },
    "guardrails": {
      "canary": {"step_percent": [10, 25, 50, 100], "step_wait_minutes": 15},
      "slo_gates": [{"metric": "error_rate_5m", "op": "<=", "threshold": 0.01}],
      "rollback": {"auto_rollback_enabled": true}
    },
    "approvals": {"owner_approved": true, "sre_approved": true}
  },
  "proposed_plan": {
    "actions": [
      {"name": "feature_flag.set_percent", "params": {"flag_key": "new_invoice_flow", "percent": 10}},
      {"name": "slo_gate.check", "params": {"window_minutes": 15}}
    ]
  },
  "expect": {
    "verdict": "ACCEPT",
    "reasons": [],
    "missing": [],
    "normalized_plan": [
      {"name": "feature_flag.set_percent", "params": {"flag_key": "new_invoice_flow", "percent": 10}},
      {"name": "slo_gate.check", "params": {"window_minutes": 15}},
      {"name": "rollback.hook.ensure", "params": {"rollback_plan_id": "rb-2026-0091"}}
    ]
  }
}

Point: Since LLM proposal plans often lack "rollback," "audit," or "withdrawal" steps, it is beneficial to configure the Verifier to forcibly add them via normalization to stabilize operations.


Case 7 (DEGRADE) Erasure: Identity verification incomplete

{
  "name": "erasure_degrade_identity_not_verified",
  "input": {
    "policy": {"policy_id": "privacy-erasure-policy", "policy_version": "2026-01-05"},
    "erasure_request": {"subject_user_id": "C-1", "identity_verification": {"verified": false}},
    "holds": {"legal_hold": false}
  },
  "proposed_plan": {"actions": [{"name": "privacy.delete", "params": {"system": "crm", "subject_user_id": "C-1"}}]},
  "expect": {
    "verdict": "DEGRADE",
    "reasons": ["identity_verification_required"],
    "missing": ["erasure_request.identity_verification.verified"],
    "normalized_plan": []
  }
}

{
  "name": "erasure_reject_legal_hold",
  "input": {
    "policy": {"policy_id": "privacy-erasure-policy", "policy_version": "2026-01-05"},
    "erasure_request": {"subject_user_id": "C-2", "identity_verification": {"verified": true}},
    "holds": {"legal_hold": true}
  },
  "proposed_plan": {"actions": [{"name": "privacy.delete", "params": {"system": "crm", "subject_user_id": "C-2"}}]},
  "expect": {
    "verdict": "REJECT",
    "reasons": ["legal_hold_blocks_erasure"],
    "missing": [],
    "normalized_plan": []
  }
}

Case 9 (ACCEPT+Normalization) Erasure: Retention obligation categories are automatically converted from delete to redact

{
  "name": "erasure_accept_normalize_retention_to_redact",
  "input": {
    "policy": {"policy_id": "privacy-erasure-policy", "policy_version": "2026-01-05"},
    "erasure_request": {"subject_user_id": "C-3", "identity_verification": {"verified": true}},
    "holds": {"legal_hold": false, "accounting_retention_required": true}
  },
  "proposed_plan": {
    "actions": [
      {"name": "privacy.delete", "params": {"system": "billing", "subject_user_id": "C-3"}},
      {"name": "privacy.tombstone.write", "params": {"subject_user_id": "C-3"}}
    ]
  },
  "expect": {
    "verdict": "ACCEPT",
    "reasons": [],
    "missing": [],
    "normalized_plan": [
      {"name": "privacy.redact", "params": {"system": "billing", "subject_user_id": "C-3", "mode": "accounting_retention"}},
      {"name": "privacy.tombstone.write", "params": {"subject_user_id": "C-3"}}
    ]
  }
}

Case 10 (DEGRADE) Underwriting: Missing documents (no decision)

{
  "name": "uw_degrade_missing_employment_proof",
  "input": {
    "policy": {"policy_id": "credit-underwriting-policy", "policy_version": "2026-01-01"},
    "application": {"application_id": "APP-1", "requested_amount_jpy": 500000},
    "documents": {"identity_verified": true, "income_proof": {"provided": true}, "employment_proof": {"provided": false}},
    "fairness_controls": {"prohibited_fields_present": false}
  },
  "proposed_plan": {"actions": [{"name": "uw.emit_decision", "params": {"decision": "APPROVE", "reason": "looks good"}}]},
  "expect": {
    "verdict": "DEGRADE",
    "reasons": ["missing_required_documents"],
    "missing": ["documents.employment_proof.provided"],
    "normalized_plan": [
      {"name": "uw.request_more_documents", "params": {"missing": ["employment_proof"]}}
    ]
  }
}

The fixing point here: Even if the LLM attempts to "decide in a way that sounds correct," the Verifier steers it in a direction that "prevents a decision" (to DEGRADE).


6. Stopping regression with a minimal harness (standard library only)

This is the main part. Just by running the "10 cases" in CI every time, operations begin to "grow."

6.1 Directory structure

golden/
  cases/
    01_jit_access_accept.json
    02_jit_access_degrade_missing_approval.json
    ...
    10_uw_degrade_missing_docs.json
  run_golden.py
  verifier_stub.py   # Replace this with your actual verifier

6.2 The harness itself (run_golden.py)

from __future__ import annotations

import json
import sys
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Dict, List, Tuple

from verifier_stub import verify  # Replace this with your own verifier in practice


@dataclass(frozen=True)
class Expect:
    verdict: str
    reasons: Tuple[str, ...]
    missing: Tuple[str, ...]
    normalized_plan: Tuple[Dict[str, Any], ...]


def load_case(path: Path) -> Dict[str, Any]:
    with path.open("r", encoding="utf-8") as f:
        return json.load(f)


def normalize_plan(plan: Any) -> Tuple[Dict[str, Any], ...]:
    # Normalize to list[dict] for easy comparison in tests
    if plan is None:
        return ()
    if not isinstance(plan, list):
        raise TypeError(f"normalized_plan must be list[dict], got {type(plan)}")

    out: List[Dict[str, Any]] = []
    for i, a in enumerate(plan):
        if not isinstance(a, dict):
            raise TypeError(f"normalized_plan[{i}] must be dict, got {type(a)}")
        out.append(a)
    return tuple(out)


def to_expect(d: Dict[str, Any]) -> Expect:
    return Expect(
        verdict=str(d.get("verdict")),
        reasons=tuple(d.get("reasons", [])),
        missing=tuple(d.get("missing", [])),
        normalized_plan=normalize_plan(d.get("normalized_plan", [])),
    )


def diff(a: Any, b: Any) -> str:
    ja = json.dumps(a, ensure_ascii=False, sort_keys=True, indent=2)
    jb = json.dumps(b, ensure_ascii=False, sort_keys=True, indent=2)
    return f"--- expected\n{ja}\n--- actual\n{jb}"


def main() -> int:
    cases_dir = Path(__file__).parent / "cases"
    paths = sorted(cases_dir.glob("*.json"))
    if not paths:
        print("No golden cases found.", file=sys.stderr)
        return 2

    failed: List[str] = []

    for p in paths:
        case = load_case(p)
        name = case.get("name", p.name)
        inp = case["input"]
        proposed = case["proposed_plan"]
        exp = to_expect(case["expect"])

        actual = verify(inp, proposed)  # <- This is the Verifier's output
        verdict = actual.get("verdict")
        if not isinstance(verdict, str) or not verdict:
            raise KeyError(f"verify() must return non-empty 'verdict' for case={name}")

        act_exp = Expect(
            verdict=verdict,
            reasons=tuple(actual.get("reasons", [])),
            missing=tuple(actual.get("missing", [])),
            normalized_plan=normalize_plan(actual.get("normalized_plan", [])),
        )

        if exp != act_exp:
            failed.append(name)
            print(f"\n[FAIL] {name}")
            print(diff(exp.__dict__, act_exp.__dict__))

    if failed:
        print(f"\nFAILED {len(failed)}/{len(paths)} cases:", ", ".join(failed), file=sys.stderr)
        return 1

    print(f"OK {len(paths)}/{len(paths)} cases")
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

6.3 The "substitution port" for the Verifier (verifier_stub.py)

In practice, replace this with your own verifier. Here, I place only the shell.

from __future__ import annotations

from typing import Any, Dict


def verify(inp: Dict[str, Any], proposed_plan: Dict[str, Any]) -> Dict[str, Any]:
    """
    The real Verifier goes here.
    This stub only shows the 'shape' of the return value.
    """
    # Example: branching based on policy_id, etc.
    policy = inp.get("policy", {})
    policy_id = policy.get("policy_id")

    # Implementation example omitted (see Typed Actions + Validator in the previous article)
    # Here is a dummy that always returns DEGRADE
    return {
        "verdict": "DEGRADE",
        "reasons": ["stub"],
        "missing": ["replace_with_real_verifier"],
        "normalized_plan": [],
    }

In practice, just by replacing verify() with your own validate_*(), the golden case operations will begin.


7. Run in CI (Minimal)

A minimal example for GitHub Actions.

name: golden
on: [push, pull_request]
jobs:
  golden:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: python golden/run_golden.py

At this point, you can detect in PRs if your own verifier:

  • Accidentally stopped REJECTing
  • Accidentally turned DEGRADE into ACCEPT
  • Lost the normalization
  • Changed the reason code (= changed the operations)

8. How to "grow" it (operational loop)

Golden cases are not finished just by being created. Here is how to grow them:

  1. Aggregate DEGRADE logs (Top N deficiencies)

  2. If that deficiency recurs, add one case

  3. If an existing case fails, perform one of the following:

    • Bug (changed unintentionally) → Fix it
    • Specification change (changed intentionally) → Update the case and leave the reason for change

Only here does "prompt improvement" gain meaning. Attempting to grow the prompt first often leads to "vibe-based optimization" that cannot be verified.


9. Common pitfalls

  • Trying to fix the LLM's output
    → What should be fixed is the "Verifier's output." It's fine for the LLM to fluctuate.
  • Trying to operate only with REJECT
    → The primary cause on the front lines is deficiency. Without DEGRADE, you will fall into a human interpretation hell.
  • reason code as a free string
    → Cannot aggregate, cannot turn into SLO. Make it an "operational type" from the start.
  • Not normalizing
    → Operations are pulled by the LLM's proposal. It is strong when the Verifier creates an "executable plan."

Summary

  • What you should nurture first is not the prompt, but the Verifier + Golden Cases
  • Golden cases fix the Verifier's output (verdict / reason / missing / normalized_plan), not the "LLM's output"
  • Even 10 cases are enough to fix the representative accidents that kill operations, and you can increase them from DEGRADE logs
  • Just running it in CI every time creates an "operation that endures" model updates, prompt changes, and specification additions

Prompts tend to end up as prayer-based vibe-driven, but that's a problem for business systems, isn't it? Since I want verification to run deterministically, it is better to prepare at least 10 tests as a framework and design the structure so it can grow along with the system's growth and operations. If using the LLM as a proposal machine, one could consider having the LLM propose case additions as well, but let's treat that as a story for the future. Let's start by solidifying the golden cases.

Discussion