iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🐴

Introduction to Harness Engineering: 5 Elements to Structure AI Agent Quality

に公開

What is Harness Engineering?

Harness engineering is an environment design methodology that enhances the output quality and reproducibility of AI agents using five elements: rules, skills, hooks, memory, and feedback. Following prompt engineering (how to ask) and context engineering (what to show), it focuses on designing "what kind of environment the agent works in." The term "harness engineering" is not advocated by a specific company but has emerged organically within the industry.

According to research on SWE-bench (e.g., Particula.tech), it has been reported that even with the same model, the pass rate fluctuates significantly depending solely on the scaffolding design. In other words, harness design is as important as, or even more important than, model selection.

Why CLAUDE.md Alone Is Not Enough

Even if you write "Please write tests before committing" in CLAUDE.md, it remains just a "request." As project scale grows, the following issues manifest:

  • No enforcement of rule compliance (agents may ignore them)
  • Disrupted memory across sessions (requires explanation from scratch every time)
  • Instructions become bloated with added skills, causing management to fail

In harness engineering, hooks automatically execute tests before commits and block the commit if they fail. The essence is to transform "requests" into "enforcement."

How Do the Three Evolutionary Stages Build on Each Other?

Each stage is not a replacement but an accumulation. Please view the following as a way of organizing them:

Stage 1: Prompt Engineering (2022–2024)
  → "How to ask": Few-shot, CoT, Role-play

Stage 2: Context Engineering (2025–)
  → "What to show": RAG, CLAUDE.md, AGENTS.md
  → Karpathy: "Techniques to fill the context window with optimal information"

Stage 3: Harness Engineering (Late 2025–)
  → "What environment to work in": Rules + Skills + Hooks + Memory + Feedback
  → Adds enforcement + persistence + automated verification

What Are the 5 Components of a Harness?

Based on Martin Fowler's framework, we organize these along two axes: feedforward (guidance before action) and feedback (verification after action).

Element Function Control Type Implementation Example
Rules Declaration of code of conduct Feedforward CLAUDE.md, .claude/rules/
Skills Reusable procedures Feedforward /write-test, SKILL.md
Hooks Event-driven triggers Feedback auto-format on file save
Memory Persistence across sessions Feedforward progress.md, Auto Memory
Feedback Multi-layered automated verification Feedback lint → type check → test

Appropriate Size for Rules

The HumanLayer blog recommends keeping CLAUDE.md under 60 lines. It is important to focus on project-specific rules rather than general best practices that the model already knows.

How to Implement Memory

There are currently three main approaches to maintaining memory across sessions:

  1. Write "Read progress.md" in CLAUDE.md: The simplest, but remains just a "request."
  2. Auto Memory (Claude Code): ~/.claude/projects/<project>/memory/MEMORY.md is automatically loaded each session. What is saved depends on Claude's judgment.
  3. Forced loading via external harness: An Anthropic research pattern where an orchestration layer automatically injects files.

Since "context compact" (context compression) can cause context loss, it is effective to write important decisions to external files like progress.md each time. Files are not compressed, so they can be re-read when needed.

Two Types of Feedback

Martin Fowler distinguishes between the following:

  • Computational control: Linters, tests. Deterministic and fast. Should be prioritized.
  • Inferential control: LLM-as-Judge. Non-deterministic and slow. Use only at the final stage.

Practical Lessons from OpenAI's Experiments

These are the experimental results (August 2025 – January 2026) from the OpenAI team led by Ryan Lopopolo.

Human-written code:     0 lines
Generated code:         approx. 1 million lines
Merged PRs:             approx. 1,500
Per engineer:           3.5 PRs/day
Token consumption:      1 billion+/day (Latent Space estimate: $2,000–$3,000/day)

"Fixing Quality Manually" Does Not Scale

The "AI Slop Friday" operation, where low-quality code is manually fixed every Friday, did not scale. The solution is to embed quality standards as automated check rules and have another AI agent automatically generate fix PRs.

To break down the examples of rules OpenAI incorporated into their harness:

  • Don't write the same logic repeatedly: Consolidate common logic into a team-shared library to prevent AI from reinventing it inconsistently.
  • Always check at data entry points: Validate external data at the entrance so that you can trust that "only correct data is flowing" internally.
  • Enforce unidirectional dependencies: Mechanically enforce through linters, for example, "UI can call Service, but not vice-versa."

Keep Instructions Short and Details in Separate Files

Limit AI instructions to an approx. 100-line table of contents and isolate details into docs/ or similar folders. AI cannot use information it hasn't read, so a structure that is easy to access is crucial. However, cramming everything into a single file results in information overload and has the opposite effect.

Anthropic's Harness Design Patterns

Pattern A: Two-Agent Configuration (November 2025)

When passing work between team members, the successor can work smoothly if the predecessor leaves notes and procedures. This is a division-of-labor pattern that applies this concept to AI agents.

Initialization Agent (executed only once at project start):

  • Creates development server startup scripts (init.sh)
  • Creates a work log (claude-progress.txt)
  • Creates a feature requirement list (feature_list.json)—a JSON file containing over 200 items detailing necessary functions, each with verification procedures and a completion flag.

Coding Agent (executed every time in subsequent sessions):

  • Reads claude-progress.txt to grasp progress from previous sessions.
  • Selects the highest priority incomplete feature from feature_list.json.
  • Implements, verifies, commits, and updates the progress file and feature list.

The two agents do not interact directly; they hand off work through files. A key characteristic of this pattern is that it does not create detailed specifications. The "description + verification steps" in the feature list effectively serve as the specification. Since Anthropic's experiment was a project to build a clone of Claude.ai, development progressed using the model's existing knowledge.

However, problems emerged. There were cases where the coding agent would mark a task as "complete" without checking the overall operation of the application. The next pattern was born to resolve these two issues: "too concise specifications" and "verification based on self-reporting."

Pattern B: Three-Agent Configuration — GAN-type (March 2026)

Based on the challenges of Pattern A, this configuration separates planning, development, and evaluation into three independent agents.

  • Planner (Planning): Generates detailed specifications from concise instructions.
  • Generator (Development): Implements based on specifications. Final judgment is deferred to the Evaluator.
  • Evaluator (Evaluation): Tests by manipulating the app in a browser using Playwright MCP.

These three agents are not manually toggled by a human; they are automatically managed by an orchestrator using the Claude Agent SDK. The Claude Agent SDK allows the internal engine of the Claude Code CLI to be used as a Python/TS library, enabling multiple agents to be launched and linked automatically via the query() function. Note that this configuration can also be achieved without the SDK by using Claude Code's skill mechanism.

A distinctive feature is the "Sprint Contract," where the Generator and Evaluator agree on "what constitutes completion" before each sprint begins.

Comparison of results:
  Simple harness: 20 minutes/$9  → Broken output
  Full harness:   6 hours/$200    → Functional and sophisticated app

Practice: Implementation Steps to Start Tomorrow

Mitchell Hashimoto's rule: "If an agent makes a mistake, design a mechanism so that the mistake never happens again." However, pre-optimization before failure is counterproductive.

Day 1:      Create CLAUDE.md (30 mins, project-specific rules only, keep it concise)
Week 1-2:   Turn repetitive instructions into skills (/write-test, /review, etc.)
Week 2-4:   Use hooks for auto-formatting, type checking, and test execution
Week 2-4:   Write "Read/Update progress.md" in CLAUDE.md. Also utilize Auto Memory
Continuous: Same violation 3 times → Raise rule strictness by 1 level

Caution: More Harnessing Is Not Always Better

Research from ETH Zurich shows that LLM-generated configuration files degraded performance and consumed over 20% more tokens. Furthermore, research from Chroma confirms a tendency for model performance to decrease as context length increases. You should aim for sufficient, not excessive.

Conclusion

Harness engineering is essentially an iterative process of "observing failures and stacking up structural preventive measures." Research regarding SWE-bench has shown that scaffolding design has a significant impact, making it a worthy investment comparable to or greater than model selection.

Reference Sources


Detailed commentary on the topics introduced in this article is available in the ZenChAIne article.

Discussion