iTranslated by AI
Introduction to Harness Engineering: 5 Elements to Structure AI Agent Quality
What is Harness Engineering?
Harness engineering is an environment design methodology that enhances the output quality and reproducibility of AI agents using five elements: rules, skills, hooks, memory, and feedback. Following prompt engineering (how to ask) and context engineering (what to show), it focuses on designing "what kind of environment the agent works in." The term "harness engineering" is not advocated by a specific company but has emerged organically within the industry.
According to research on SWE-bench (e.g., Particula.tech), it has been reported that even with the same model, the pass rate fluctuates significantly depending solely on the scaffolding design. In other words, harness design is as important as, or even more important than, model selection.
Why CLAUDE.md Alone Is Not Enough
Even if you write "Please write tests before committing" in CLAUDE.md, it remains just a "request." As project scale grows, the following issues manifest:
- No enforcement of rule compliance (agents may ignore them)
- Disrupted memory across sessions (requires explanation from scratch every time)
- Instructions become bloated with added skills, causing management to fail
In harness engineering, hooks automatically execute tests before commits and block the commit if they fail. The essence is to transform "requests" into "enforcement."
How Do the Three Evolutionary Stages Build on Each Other?
Each stage is not a replacement but an accumulation. Please view the following as a way of organizing them:
Stage 1: Prompt Engineering (2022–2024)
→ "How to ask": Few-shot, CoT, Role-play
Stage 2: Context Engineering (2025–)
→ "What to show": RAG, CLAUDE.md, AGENTS.md
→ Karpathy: "Techniques to fill the context window with optimal information"
Stage 3: Harness Engineering (Late 2025–)
→ "What environment to work in": Rules + Skills + Hooks + Memory + Feedback
→ Adds enforcement + persistence + automated verification
What Are the 5 Components of a Harness?
Based on Martin Fowler's framework, we organize these along two axes: feedforward (guidance before action) and feedback (verification after action).
| Element | Function | Control Type | Implementation Example |
|---|---|---|---|
| Rules | Declaration of code of conduct | Feedforward | CLAUDE.md, .claude/rules/
|
| Skills | Reusable procedures | Feedforward |
/write-test, SKILL.md |
| Hooks | Event-driven triggers | Feedback | auto-format on file save |
| Memory | Persistence across sessions | Feedforward | progress.md, Auto Memory |
| Feedback | Multi-layered automated verification | Feedback | lint → type check → test |
Appropriate Size for Rules
The HumanLayer blog recommends keeping CLAUDE.md under 60 lines. It is important to focus on project-specific rules rather than general best practices that the model already knows.
How to Implement Memory
There are currently three main approaches to maintaining memory across sessions:
- Write "Read progress.md" in CLAUDE.md: The simplest, but remains just a "request."
-
Auto Memory (Claude Code):
~/.claude/projects/<project>/memory/MEMORY.mdis automatically loaded each session. What is saved depends on Claude's judgment. - Forced loading via external harness: An Anthropic research pattern where an orchestration layer automatically injects files.
Since "context compact" (context compression) can cause context loss, it is effective to write important decisions to external files like progress.md each time. Files are not compressed, so they can be re-read when needed.
Two Types of Feedback
Martin Fowler distinguishes between the following:
- Computational control: Linters, tests. Deterministic and fast. Should be prioritized.
- Inferential control: LLM-as-Judge. Non-deterministic and slow. Use only at the final stage.
Practical Lessons from OpenAI's Experiments
These are the experimental results (August 2025 – January 2026) from the OpenAI team led by Ryan Lopopolo.
Human-written code: 0 lines
Generated code: approx. 1 million lines
Merged PRs: approx. 1,500
Per engineer: 3.5 PRs/day
Token consumption: 1 billion+/day (Latent Space estimate: $2,000–$3,000/day)
"Fixing Quality Manually" Does Not Scale
The "AI Slop Friday" operation, where low-quality code is manually fixed every Friday, did not scale. The solution is to embed quality standards as automated check rules and have another AI agent automatically generate fix PRs.
To break down the examples of rules OpenAI incorporated into their harness:
- Don't write the same logic repeatedly: Consolidate common logic into a team-shared library to prevent AI from reinventing it inconsistently.
- Always check at data entry points: Validate external data at the entrance so that you can trust that "only correct data is flowing" internally.
- Enforce unidirectional dependencies: Mechanically enforce through linters, for example, "UI can call Service, but not vice-versa."
Keep Instructions Short and Details in Separate Files
Limit AI instructions to an approx. 100-line table of contents and isolate details into docs/ or similar folders. AI cannot use information it hasn't read, so a structure that is easy to access is crucial. However, cramming everything into a single file results in information overload and has the opposite effect.
Anthropic's Harness Design Patterns
Pattern A: Two-Agent Configuration (November 2025)
When passing work between team members, the successor can work smoothly if the predecessor leaves notes and procedures. This is a division-of-labor pattern that applies this concept to AI agents.
Initialization Agent (executed only once at project start):
- Creates development server startup scripts (
init.sh) - Creates a work log (
claude-progress.txt) - Creates a feature requirement list (
feature_list.json)—a JSON file containing over 200 items detailing necessary functions, each with verification procedures and a completion flag.
Coding Agent (executed every time in subsequent sessions):
- Reads
claude-progress.txtto grasp progress from previous sessions. - Selects the highest priority incomplete feature from
feature_list.json. - Implements, verifies, commits, and updates the progress file and feature list.
The two agents do not interact directly; they hand off work through files. A key characteristic of this pattern is that it does not create detailed specifications. The "description + verification steps" in the feature list effectively serve as the specification. Since Anthropic's experiment was a project to build a clone of Claude.ai, development progressed using the model's existing knowledge.
However, problems emerged. There were cases where the coding agent would mark a task as "complete" without checking the overall operation of the application. The next pattern was born to resolve these two issues: "too concise specifications" and "verification based on self-reporting."
Pattern B: Three-Agent Configuration — GAN-type (March 2026)
Based on the challenges of Pattern A, this configuration separates planning, development, and evaluation into three independent agents.
- Planner (Planning): Generates detailed specifications from concise instructions.
- Generator (Development): Implements based on specifications. Final judgment is deferred to the Evaluator.
- Evaluator (Evaluation): Tests by manipulating the app in a browser using Playwright MCP.
These three agents are not manually toggled by a human; they are automatically managed by an orchestrator using the Claude Agent SDK. The Claude Agent SDK allows the internal engine of the Claude Code CLI to be used as a Python/TS library, enabling multiple agents to be launched and linked automatically via the query() function. Note that this configuration can also be achieved without the SDK by using Claude Code's skill mechanism.
A distinctive feature is the "Sprint Contract," where the Generator and Evaluator agree on "what constitutes completion" before each sprint begins.
Comparison of results:
Simple harness: 20 minutes/$9 → Broken output
Full harness: 6 hours/$200 → Functional and sophisticated app
Practice: Implementation Steps to Start Tomorrow
Mitchell Hashimoto's rule: "If an agent makes a mistake, design a mechanism so that the mistake never happens again." However, pre-optimization before failure is counterproductive.
Day 1: Create CLAUDE.md (30 mins, project-specific rules only, keep it concise)
Week 1-2: Turn repetitive instructions into skills (/write-test, /review, etc.)
Week 2-4: Use hooks for auto-formatting, type checking, and test execution
Week 2-4: Write "Read/Update progress.md" in CLAUDE.md. Also utilize Auto Memory
Continuous: Same violation 3 times → Raise rule strictness by 1 level
Caution: More Harnessing Is Not Always Better
Research from ETH Zurich shows that LLM-generated configuration files degraded performance and consumed over 20% more tokens. Furthermore, research from Chroma confirms a tendency for model performance to decrease as context length increases. You should aim for sufficient, not excessive.
Conclusion
Harness engineering is essentially an iterative process of "observing failures and stacking up structural preventive measures." Research regarding SWE-bench has shown that scaffolding design has a significant impact, making it a worthy investment comparable to or greater than model selection.
Reference Sources
- OpenAI — Harness engineering: leveraging Codex in an agent-first world
- Martin Fowler — Harness engineering for coding agent users
- Anthropic — Effective harnesses for long-running agents
- Anthropic — Harness design for long-running application development
- Particula.tech — Agent Scaffolding Beats Model Upgrades
- HumanLayer — Skill Issue: Harness Engineering
Detailed commentary on the topics introduced in this article is available in the ZenChAIne article.
Discussion