iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🐴

Redesigning my Claude Code environment based on Anthropic's 'Harness Design' research

に公開

Body

When you ask an AI, "Is this implementation okay?", it almost always replies, "Yes, it is implemented appropriately." However, when you actually run it, the code often contains issues that break in production. Anthropic published a piece on their official blog detailing the structure of this phenomenon and solutions, complete with experimental data.

Conclusion

  1. Anthropic's research on "harness design," announced on their official blog, experimentally demonstrated that AI tends to evaluate its own output too leniently.
  2. The solution is to separate the "Generative AI" and the "Evaluator AI." This structure can be replicated in Claude Code settings (via CLAUDE.md, skills, agents, and hooks).
  3. After reviewing my own environment, I realized I had "Generation" capabilities but almost no "Evaluation" mechanisms. I have since translated evaluation criteria into skills and rules.

This article introduces Anthropic's research and documents how I applied it to my Claude Code environment.


Anthropic's Experiment: The AI Self-Evaluation Problem and the 3-Agent Structure

In March 2026, Anthropic published "Harness design for long-running apps" on their official blog. It reports on an experiment where AI agents were given several hours to build apps autonomously. Here is a summary of the core findings:

When agents are asked to evaluate their own work, they tend to praise the result with confidence, even when the output is clearly mediocre to the human eye.

Even when the AI identifies a problem, it often judges it as "not a big deal" and approves it. Anthropic itself acknowledges that "Claude is insufficient as a QA agent in its initial state."

The solution built for this was a system consisting of three agents:

Agent Role Responsibility
Planner Expands a 1-line prompt into a full product specification Scope definition
Generator Implements code based on the specification Implementation
Evaluator Operates the running app to find bugs and provides feedback based on scoring criteria Quality Assurance

In the solo execution (without an Evaluator), core game features failed to work, whereas with the Evaluator (full harness), basic operations functioned. The Evaluator tested against 27 criteria and provided specific bug reports, such as "The fillRectangle function exists but is not triggered correctly on mouseUp."

Seeing this structure was what triggered me to review my environment, thinking, "I can replicate this in Claude Code settings."


Rethinking Claude Code's Configuration as a "Harness"

If you map Claude Code's configuration to Anthropic's 3-agent structure, it looks like this:

Anthropic Structure Claude Code Mapping Specific Settings
Planner CLAUDE.md + Planning Skills Project overview, constraints, design decisions
Generator Claude Code CLI + Technical Skills Code generation, implementation
Evaluator Review Agent + Rules + Hooks Quality standards, automated checks

Looking at this mapping, I realized that while I had well-maintained "Planner" and "Generator" components in my environment, there was almost no "Evaluator."

CLAUDE.md contained "what I am building." Skills contained "how it should be written." But there was no mechanism to "verify if what was written is truly correct." When I asked the AI, "Is this implementation okay?", it would only reply, "Yes, it is appropriate," just as Anthropic's experiment demonstrated.


Integrating Evaluators into Configuration: Three Layers

Inspired by Anthropic's research, I redesigned my Claude Code evaluation mechanism across three layers.

Layer 1: Rules (Evaluation criteria always applied)

The rule files placed in ~/.claude/rules/ are automatically loaded in every session. I write down "checks that are essential for quality, which the AI would not do naturally."

Example: Evaluation criteria for Supabase/PostgreSQL (subset of 30 rules)

# ~/.claude/rules/supabase-postgres.md (excerpt)

## Add indexes to FK columns
PostgreSQL does not automatically index FK columns.

## RLS performance optimization
Wrap auth.uid() in a SELECT (prevent per-row invocation):
- BAD: using (user_id = auth.uid())
- GOOD: using (user_id = (select auth.uid()))

## Cursor-based pagination
- BAD: OFFSET (slow on deep pages)
- GOOD: WHERE id > $last_id ORDER BY id LIMIT 20

The quality of the code the AI returns changes depending on whether these rules exist. RLS that doesn't wrap auth.uid() in a SELECT works while the table is small. Tests pass. But it slows down as records increase in production. This is the same structure reported in Anthropic's experiment, where "AI settles for superficial tests and overlooks deep bugs."

Rules exist to "ensure the AI never steps on the same landmine twice," corresponding to the design of evaluation criteria in Anthropic's research.


Layer 2: Skills (Specialized evaluators activated only when needed)

Since rules are always loaded, increasing them too much puts pressure on the context window. I separate evaluation criteria required only for specific tasks into skills (.claude/skills/).

Skills only have their titles in the context normally, and their content is read only when necessary. You can also configure them to auto-activate via keywords.

# ~/.claude/CLAUDE.md (Example of auto-activation settings)

## Automatic Skill Activation

### Test Creation / TDD
**Trigger:** test, TDD, coverage
**Action:** Execute test-driven-development skill

### Bug / Error Handling
**Trigger:** bug, error, debug, issue
**Action:** Execute systematic-debugging skill

### Completion Verification
**Trigger:** done, verify, review
**Action:** Execute verification-before-completion skill

The "Completion Verification" skill is important. In Anthropic's experiment, the evaluator performed checks at the end of each sprint. I replicate this by using a skill that auto-activates upon task completion. Just by saying "verify this," the quality checklist runs in the background.

I currently have 27 skills installed, 7 of which auto-activate via keywords.


Layer 3: Agent Separation (Treating generation and evaluation as distinct personas)

The core of Anthropic's research was "separating the creator from the evaluator." In Claude Code, you can replicate this through agent orchestration settings.

# ~/.claude/rules/agents.md

## Manager Role
You are a Manager and Agent Orchestrator.

**Absolute Rules:**
- Never implement anything yourself; delegate everything to a Sub Agent.
- Break down tasks extensively and build a PDCA cycle.

## Delegation timing to Sub Agent
### Cases to delegate immediately
1. Implementation tasks — code addition/modification
2. Debugging — error investigation/fixing
3. Test creation — Unit/E2E tests

### What the Manager should handle directly
1. Task breakdown — splitting large requests into smaller tasks
2. Progress confirmation — verifying Sub Agent results
3. Planning adjustments — adjusting plans if issues arise

The point is to fix the main Claude persona as the "Manager/Evaluator" and delegate implementation to a Sub Agent. This achieves the "separation of generation and evaluation" at the configuration level, as demonstrated in Anthropic's experiment.

The main Claude does not implement. It delegates implementation to the Sub Agent and then moves to the side of verifying the results. In terms of Anthropic's 3-agent structure, the main Claude acts as the Planner and Evaluator, while the Sub Agent acts as the Generator.


3 Principles for Designing Evaluation in Claude Code

While reviewing the entire harness, I extracted three design principles from Anthropic's article.

Principle 1: Weight criteria toward "things the AI is bad at"

In Anthropic's frontend experiment, they established four evaluation criteria (design quality, originality, technical aspects, and functionality). They reported that they "placed more weight on design quality and originality than on technical strength and functionality."

The reason is that the AI was already capable of producing high scores in technical and functional aspects. The weight was lowered for things the AI does naturally, and raised for things the AI struggles with.

Applying this to my environment looks like this:

  • Low weight (AI is good at): Syntactically correct code, basic API endpoints
  • High weight (AI is bad at): Performance traps (the RLS auth.uid() issue), UI design based on consumer psychology, blind spots in security

The content written in rules and skills should also be designed around "things the AI overlooks," rather than "things the AI does naturally."

Principle 2: Regularly review each component of the harness

There is a passage in Anthropic's article:

Every component of the harness encodes assumptions about what the model cannot do on its own, and these assumptions can be wrong and may quickly become obsolete as the model improves.

Indeed, when the model changed from Opus 4.5 to Opus 4.6, the mechanism for sprint splitting became unnecessary because context anxiety (the phenomenon of cutting tasks short as the context limit approaches) was almost resolved in Opus 4.6.

The same happens with Claude Code settings. Rules that were necessary before may become unnecessary due to model improvements. Conversely, as models get smarter, new problems may emerge, requiring new rules.

In my case, I have revised CLAUDE.md eight times. At first, it had ballooned to over 200 lines, but I began reducing it based on the criteria: "Would the AI make a mistake if I deleted this line?" The same criteria are mentioned in the official documentation.

Principle 3: As models get smarter, the potential of the harness expands

The space of interesting harness combinations does not shrink, but rather expands, as model accuracy improves.

This is Anthropic's conclusion. It is counterintuitive, but it matches my own experience.

As the AI gets smarter, the range of tasks you can entrust to it broadens. However, in proportion, the points where you "must set evaluation criteria to avoid failure" become more sophisticated and subtle. Simple rules (like "add indexes to FK columns") might eventually be handled naturally by the model. But judgments like "will users want to buy this design?" will remain as necessary evaluation criteria.


Claude Code Harness Design: File Structure

Here is the final file structure of my configuration.

~/.claude/
├── CLAUDE.md               # Planner layer (Project overview + skill auto-activation settings)
├── rules/
│   ├── agents.md           # Agent separation (Separating generation and evaluation)
│   ├── supabase-postgres.md # Evaluation criteria (DB/RLS 30 rules)
│   ├── react-nextjs.md     # Evaluation criteria (React/Next.js)
│   ├── security.md         # Evaluation criteria (Security)
│   ├── coding-style.md     # Code quality standards
│   └── testing.md          # Test quality standards
└── skills/                  # Specialized evaluators activated only when needed
    ├── verification-before-completion  # Post-completion check
    ├── systematic-debugging            # Debugging check
    ├── test-driven-development         # TDD enforcement
    └── ... (27 in total)

Mapping to Anthropic's 3-agent structure:

  • CLAUDE.md → Planner output (equivalent to product specifications)
  • rules/ → Evaluation criteria (equivalent to sprint contracts)
  • skills/ → Specialized evaluators (activated when needed)
  • agents.md → Separation of generation and evaluation (fixing main Claude as the evaluator)

Summary: Harness Design Checklist

Based on Anthropic's research, here is a checklist for designing your Claude Code environment as a "harness."

Planner Layer (CLAUDE.md)

  • Contains project overview and constraints
  • Explains reasoning for design decisions that the AI cannot guess
  • Kept under 200 lines (otherwise instructions get ignored)

Evaluation Criteria Layer (rules/)

  • Contains rules that are essential for quality but which the AI does not do naturally
  • Prioritizes "things the AI overlooks" over "things the AI is good at"
  • Regularly removes unnecessary rules

Specialized Evaluator Layer (skills/)

  • Organized by technical domain (DB, React, security, etc.)
  • Has settings for auto-activation via keywords
  • Includes a skill for completion verification

Generation and Evaluation Separation (agents.md)

  • Main Claude does not implement, but delegates to a Sub Agent
  • Has a flow to verify implementation results

Discussion