iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🌊

River Review v0.30–v0.33: A Half-Month of Improvement Loops and Refining applyTo Scoping

に公開

This article is a reconstruction of the work logs for River Review v0.30.0 to v0.33.0 (as of 2026-05-08).

TL;DR

  • Over the half-month from v0.30.0 to v0.33.0, River Review was redefined from an "agent that only executes reviews" to an "Improvement Loop OS" (an operational foundation for "Review → Verify → Institutionalize" that verifies review results and feeds them back into mechanisms like test cases, suppression rules, and references).
  • Through Epic #743 (P1+P2), the responsibilities of the entry skill river-review (input classification, specialist skill selection, verification, feedback classification, and loop handover) were clearly defined.
  • New applyTo scoping rules (docs/development/skill-applyto-scoping.md) were established, and overly broad globs that caused planner false-positive routing were reorganized across 13 skills in 2 batches.
  • The planner-dataset eval (evaluation set for routing, coverage=1.0 / top1Match=1.0 / 23 cases) remained green throughout the entire period, with updates focused solely on increasing the ability to detect routing regressions.

What Changed (high-level)

Period: 2026-04-30 (v0.29.0) to 2026-05-06 (v0.33.0), a total of 4 releases.

Release Key Changes
v0.30.0 Added rr-upstream-context-budget-tuning-001 skill (#736)
v0.31.0 Epic #743 P1: Redefined river-review as an improvement-loop orchestrator + 3 references (VERIFICATION / FEEDBACK / IMPROVEMENT_LOOP) (#744 #745)
v0.32.0 Epic #743 P2: routing/planner eval cases (#746) + feedback-to-fixture conversion workflow (#747) + suppression-feedback fixtures (#739) + eval-driven-skill-design skill (#737)
v0.33.0 applyTo scoping rules + cleanup of 13 skills (Epic #762 / Implementation PRs #766 #767)

In parallel, 4 dependabot PRs, Docusaurus version alignment, translations (#733 #734), and the addition of the code_search dependency (#738 PR-2) were also completed.

Epic #743: improvement-loop orchestrator

Before / After

Before: skills/agent-skills/river-review/SKILL.md was a "router that distributed tasks to specialist skills based on keywords." Misdetections and oversights were fixed via ad-hoc prompt corrections, leaving no trace in the repository.

After: The same SKILL.md expands the responsibilities of the entry skill into the following 6 items, each of which is explored in depth in a reference:

  1. Classify input intent — Determines the target category based on user intent / phase / artifact / risk
  2. Select specialist skills — Routing table and priority rules
  3. Create review execution plan — Collects artifacts according to input priority
  4. Verify findings — 6 self-check items in references/VERIFICATION.md
  5. Classify feedback — 7-type taxonomy in references/FEEDBACK.md
  6. Hand off learnings — 9-step loop in references/IMPROVEMENT_LOOP.md

The Role of the 3 References

File Role
VERIFICATION.md Self-checks before outputting findings (6 items, such as whether evidence is linked to diffs, if impact is concrete, and if severity and confidence are calibrated) and rejection conditions
FEEDBACK.md Classifies feedback into 7 types: accepted / false_positive / missed_issue / not_actionable / duplicate / accepted_risk / unclear. Provides a one-to-one mapping for where each should be funneled (fixture / suppression / reference / routing)
IMPROVEMENT_LOOP.md 9-step loop: Route → Review → Verify → Classify → Patch One Thing → Add Fixture → Run Eval → Record Learning → Promote Rule

In addition, FEEDBACK_TO_FIXTURE.md was added in P2 as a supplement, summarizing the "primary destination / secondary destination / required eval command / whether rationale is required" in a single table for each feedback type. Procedures for triaging missed_issue into 3 root causes—routing miss / missing context / weak instructions—have also been clearly documented.

Why This Is Beneficial

  • Doesn't end with prompt fixes — Every piece of feedback is guaranteed to result in an update to a fixture / reference / suppression / or routing configuration
  • Conveys the meaning of HIGH_SEVERITY guards — Consistent behavior across docs and eval prompts, where major / critical findings will reappear via the guard even if suppressed without accepted_risk classification
  • Enables regression detection of routing via planner-dataset — Added 3 cases in #746: "architecture intent," "pre-mortem intent," and "multi-skill (security + observability)," enabling protection via coverage and top1Match

eval-driven improvement loop

Where to Use npm run eval:fixtures / npm run planner:eval:dataset

Excerpt from the conversion table in FEEDBACK_TO_FIXTURE.md.

feedback type Primary Destination Required Eval Command
accepted (None) npm run eval:fixtures
false_positive Guard fixture (<NN>-guard.md / *-should-not-detect) npm run eval:fixtures + npm run eval:repo-context
missed_issue Happy-path fixture (<NN>-happy.md / *-should-detect) npm run eval:fixtures + npm run eval:repo-context + npm run planner:eval:dataset
not_actionable Fix template in reference / add example npm run skills:validate
duplicate Update routing (clarify owner skill) or logic within skill npm run planner:eval:dataset + npm run skills:validate
accepted_risk Suppression entry (rationale required) npm run skills:validate
unclear Improve wording in skill SKILL.md / reference npm run skills:validate

We standardized the operation of verifying that eval exits with code 0 locally before pushing. Relying on more than just CI passes is proving effective.

New Skill rr-upstream-eval-driven-skill-design-001 (#737)

An upstream skill that checks if fixtures/ happy-path × guard pairs and eval/ wiring (e.g., promptfoo.yaml or cases.json) are present when a new skills/**/SKILL.md is included in a PR. If present, it remains silent; if missing, it issues a minor finding and guides the user to wire up npm run eval:fixtures / npm run eval:repo-context. Since the Pre-execution Gate reacts only to the addition of a new SKILL.md, it does not trigger for PRs editing existing skills.

applyTo scoping rules (#762)

What Was the Problem?

  • Skills with bare extension globs such as applyTo: ['**/*.ts', '**/*.tsx'] were firing on test files under tests/ or *.config.ts—files that were not originally within the skill's scope.
  • This manifested as planner false-positive routing: tokens were consumed by the prompt, but the output often became noise due to domain mismatch.

New Rules (docs/development/skill-applyto-scoping.md)

Defines 3 cases where applyTo is considered "over-broad":

  1. The pattern is unconstrained ('**/*') and the skill is not meta / process / sample.
  2. The pattern is bound only by extension, but the skill's review domain is stream-specific (upstream, midstream, or downstream).
  3. Matches files outside the skill's domain for a typical project layout (e.g., a midstream code-quality skill matching tests/**).

Recommended applyTo by phase:

  • upstream: docs/architecture/**/*.md, docs/**/*architecture*.md, docs/**/*design*.md, **/*.adr, etc.
  • midstream: src/**/*.{ts,tsx}, app/**/*.{ts,tsx}, lib/**/*.{ts,tsx}, packages/**/*.{ts,tsx} (per extension)
  • downstream: tests/**/*.{ts,tsx,js,jsx}, __tests__/**/*.{ts,tsx,js,jsx}, **/*.test.{ts,tsx,js,jsx}, **/*.spec.{ts,tsx,js,jsx}

Results (13 skills)

Implementation for Epic #762 was split into two PRs.

  • Batch 1 (#766, 8 midstream skills) — Replaced **/*.ts / **/*.tsx with dir-bounded patterns for src|app|lib|packages
  • Batch 2 (#767, 5 upstream skills) — Replaced **/*.md / **/*.{yaml,yml,json} with dir-bounded patterns for docs|pages|specs|design|architecture

Planner-dataset eval maintained 23 cases / coverage=1.0 / top1Match=1.0 throughout the entire period.

The initial audit estimate of "50 over-broad" skills diverged from the actual count of 13. The main reason discovered during measurement was that skills falling under excludedTags (sample / hello / policy / process / routing) were already being excluded by the planner and had no impact.

Learnings

  • Institutionalize vs. Prompt — When the discipline of converting every piece of feedback into a fixture / suppression / reference / or routing update is maintained, prompt corrections do not recur.
  • Verify planning numbers with actual measurements — The audit's "50 over-broad" estimate was actually 13 when excludedTags were considered. It is faster to measure on the implementation side before committing to numbers in a plan.
  • Use planner-eval as a guard — Setting coverage=1.0 / top1Match=1.0 as a merge gate allows for mechanically stopping routing regressions even when scoping is narrowed.
  • Release-please should exclude chore commits from version bumps — It is expected that the version does not move with consecutive merges of docs/chore PRs. Since we had several feature PRs this period, it landed on 4 releases.

Discussion