iTranslated by AI
Gradient Between LLM Workflow and ReAct Quadrants: Design and Operational Phases in Skill Engineering
Series Position: The 4th installment in the ReAct Agent application series. The quadrant names align with previous works and the AAP repository.
Premise — Revisiting the 4 Quadrants
I will restate the 4 quadrants established in the first article.
- (1) Script Quadrant — Deterministic × Definable. Handled by scripts/pipelines.
- (2) Algorithmic Search Quadrant — Deterministic × Exploratory. The domain of classic AI / OR.
- (3) LLM Workflow Quadrant — Semantic Judgment × Definable. Invoking LLMs within a pre-defined workflow.
- (4) Autonomous Agentic Loop Quadrant — Semantic Judgment × Exploratory. An autonomous loop where the LLM decides the next action (i.e., ReAct agent).
Up until the previous entries, I have treated (3) and (4) as a dichotomy where one must be chosen based on the nature of the task. The second installment pointed out that the industry lacks established vocabulary for (3), and the third installment introduced "Phase Separation" between the design phase and the operation phase, viewing this dichotomy through the lens of business systems. This article aims to bring that Phase Separation down to the resolution of individual skill design, showing that there is a continuous gradient between the two poles.
Observation — The Same Work Living in Multiple Forms
Looking through my repositories, functions performing conceptually identical work are implemented in multiple different forms. The Promote phase within the AKC (Agent Knowledge Cycle)—the 6-phase cycle where an agent distills knowledge from its own output for self-improvement—is a prime example. This involves extracting recurring principles from skills and promoting them to rules. Generalized, it is the task of "extracting principles from repetition." Mining alert rules from logs, finding common patterns in a codebase for refactoring, or compiling FAQs from customer support conversation logs all belong to the same family.
contemplative-agent contains a Python function called core/rules_distill.py. It vectorizes skills via embeddings, creates thematic clusters with cluster_patterns, and builds an LLM batch with MAX_RULES_BATCH=10, with cluster thresholds determined via a calibration file as CLUSTER_THRESHOLD_RULES=0.65. The LLM call is just one step in a pipeline, and most decisions are frozen into code beforehand.
On the other hand, the rules-distill skill (~/.claude/skills/rules-distill/SKILL.md) is written in natural language. In Phase 1, scan-skills.sh creates a file list, in Phase 2, a sub-agent is delegated the judgment of thematic clustering, and cross-batch merging is also decided by LLM judgment. The script is used only for list creation and persistence; the core judgment resides in the runtime LLM.
Both perform the same work. Yet, the implementations are drastically different.
Surface Hypothesis — Model Capability
The first explanation that comes to mind is model capability. core/rules_distill.py was written to run in a 9B local model environment. My implementation observation was that it is difficult to run thematic cluster judgment robustly at runtime with 9B. Therefore, by using the scaffolding of embeddings + clustering + threshold adjustment, the role of 9B is limited to "generating a rule candidate from the given clusters." Conversely, in a Claude Code environment where Opus can be called at runtime, that scaffolding becomes unnecessary. The LLM can handle cluster judgment and merge judgment while considering the entire context.
This aligns with the design principle recorded in AKC—Design Principle 5 of the README: "Evaluation scales with model capability—small models operate via rubric evaluation, while Opus-class models operate via qualitative evaluation considering the full context." Although the original text focuses on the context of evaluation, if one extends the idea of changing the judgment scale according to capability to the choice of implementation placement, the Python pipeline of the 9B era and the natural language skill of the Opus era can be read as the two endpoints of that extension.
However, this explanation only brushes the surface of implementation placement. The difference in capability is indeed an observed fact, but it is a downstream result of a deeper decision.
Deep Hypothesis — AAP Phase
Returning to the distinction between the design phase and the operation phase written in the previous article, "Do ReAct Agents Need to be in Production?". It is this phase axis that truly drives the implementation placement.
contemplative-agent is positioned in the operation phase. Once deployed, inputs flow through fixed paths. The boundaries between input (persona templates, dialogue, memory) and output (responses, logs) are determined in advance, and the internal knowledge cycle also runs according to the predefined 6 phases. Because the pipeline is frozen in the operation phase, structures like the embedding / clustering / threshold / batch in core/rules_distill.py can be baked into Python. 9B is a consequence that can be chosen in this environment, not the cause.
In fact, the contemplative-agent pipeline is built so that the LLM calling portion can optionally be switched to Claude or GPT models. Even if capability is scaled upwards, the pipeline structure of the LLM functions itself does not fundamentally break. This serves as a counter-experimental proof against the surface hypothesis (that capability requires the pipeline structure). If capability were the cause, the pipeline structure should become redundant and collapse the moment it is replaced with an Opus-class model. But in reality, it does not. It is the phase decision—the operation phase—that demands the pipeline structure, not capability. Capability is merely a consequence chosen downstream of the phase decision.
Claude Code is a tool that resides in the design phase. Each session itself is an exploratory task, and the target codebase, the IDE, the agents used, and the file types handled (.py / .ts / .md / .ipynb / etc.) shift per session. If a skill pre-determines paths or file structures, it will break the moment the environment moves. Therefore, the skill body is written in natural language to adapt to the runtime context. Opus is the consequence that supports the overall judgment necessary for that runtime adaptation, not the cause.
In other words:
- contemplative-agent can be written in the
core/rules_distill.pyformat = Because the pipeline was frozen in the operation phase - Claude Code must be written in the
rules-distillskill format = Because the environment cannot be frozen in the design phase
Capability appears to be downstream of this phase decision. The phase draws the boundary between "hard-codeable" and "flexibility required," and the choice of capability is established within that boundary. The extension of AKC's "Capability ↑ → Overall judgment OK" can be read as a secondary principle that holds when the phase decision is given.
Within the Phase — Task Nature Shifts Position Even in the Same Phase
Even focusing on the design phase, implementation placement further branches out.
There is a skill called skill-comply. It is a skill that measures whether Claude's skills, rules, and agent definitions are actually being followed at runtime. It automatically generates scenarios with three levels of prompt intensity (supportive → neutral → competing), runs claude -p, classifies trace calls, and reports compliance rates. The directory contains pyproject.toml / prompts/ / fixtures/ / tests/ / scripts/ / results/. A requirement of the task is that scenario execution must be reproducible, otherwise the compliance rate has no meaning. Generalized, it is a task similar to static code analysis (linting or SonarQube) or automated benchmarking—it is required that the same input returns the same judgment. Although it is a skill within the design phase, the task nature of evaluation and measurement requires reproducibility.
On the other hand, a skill called search-first is for searching for existing libraries and tools before starting a new implementation, and the directory contains only one file: SKILL.md. It delegates to a scout agent and returns an Adopt/Extend/Build judgment. Generalized, it is similar to the work where an engineer patrols GitHub or PyPI before creating a new feature to narrow down candidate libraries and judges whether to Adopt or Build based on the context. It does not need to return the same set of candidates every time, and a judgment based on the context is sought. It is a skill within the design phase, and the task nature of selection and judgment requires pure judgment. Reproducibility is secondary; it is not a problem if it returns a slightly different set of candidates each time.
Both are in the design phase and operate in the same environment of Opus + open environment. Nevertheless, their implementation placements divide into the workflow side (skill-comply) and the ReAct side (search-first). This shows that even after the phase is decided, the task nature independently pulls the placement.
However, phases and tasks are not completely orthogonal. There is also an aspect where the phase changes how the task appears, such as tasks with high reproducibility requirements in the operation phase being strongly pushed toward the workflow side. This article treats the two axes roughly, but examining the magnitude of their interaction in detail is beyond the scope of this work.
The Same Phase Axis Descends into the Skill Internals
The same phase axis descends not only to the skill unit but also to the sub-component units within a skill. The contrast between skill-stocktake and context-sync is symbolic.
skill-stocktake is a skill that lists installed Claude skills and commands to audit their quality. It hard-codes the domain definition of "which skill files to evaluate" in a script (~/.claude/skills/, {cwd}/.claude/skills/). Generalized, it is work linked to SBOM (Software Bill of Materials) generation or dependency scanning—it is required to "mechanically enumerate targets." The advantage is that there are no omissions. The disadvantage is that it is premised on the Claude Code environment, and it will not run as-is in other coding agent environments. Considering generalization, scripts would need to be written separately for each agent, increasing maintenance costs.
context-sync is a skill that detects and corrects role duplication, obsolescence, and omissions in project documentation (CLAUDE.md / CODEMAPS / ADR / README, etc.). Unlike skill-stocktake, it leaves the domain definition to the overall judgment of the LLM. Generalized, it is similar to work where a tech lead reads through a project's architecture and judges, "This explanation is old, this part should be moved to an ADR." Therefore, if it finds an llms.txt, it includes that in the synchronization targets, and it smartly detects Codemaps, ADRs, and AGENTS.md (files that perform the same function under different names). The advantage is cross-environment portability. The disadvantage is that the detection range fluctuates with each execution, and when the codebase reaches a full scale, it cannot fully capture differences even if it detects them.
| Comparison Item | skill-stocktake | context-sync |
|---|---|---|
| Location of Domain Definition | Script-fixed | LLM judgment |
| Coverage | No omissions | Fluctuating |
| Portability | Single environment premise | Cross-environment |
| Scale Tolerance | High | Low |
"Sub-components" referred to here are the processing steps within a skill. One skill is one job from the outside, but it is divided into multiple steps internally. Taking skill-stocktake as an example, it can be divided into an object enumeration step (A) that decides "which skill files to evaluate" and a quality judgment step (B) that "judges the quality of each skill." context-sync is isomorphic, divided into two steps: (A) "which documents to target for sync" (object enumeration) and (B) "judging the age of content and parts that should be moved" (quality judgment). The sub-components referred to in this work point to steps within a skill such as these A and B.
This shows that the phase axis descends not only to the skill unit but also to the sub-component unit within a skill. Even with two skills belonging to the same phase, their placements for each step are different. skill-stocktake is a hybrid placement that puts A (object enumeration) in a script and B (quality judgment) in the LLM. context-sync is a ReAct-leaning placement that puts both A and B in the LLM. Even within the same skill, one part can be divided to the workflow side, and another part to the ReAct side.
Both take different placements within the design phase because there is a difference in the identifiability of the target. The design phase is characterized by "the environment moves," but the movement differs depending on the target. The skills of Claude, which are the targets of skill-stocktake, are placed in fixed locations and fixed naming conventions: ~/.claude/skills/ (global) and {cwd}/.claude/skills/ (project). Even in the design phase, paths can be hard-coded in scripts. On the other hand, the targets of context-sync (CLAUDE.md / CODEMAPS/ / docs/adr/ / AGENTS.md / llms.txt ...) have different placements and names depending on the codebase. Since they cannot be written out in a script, there is no choice but to leave it to the LLM's overall judgment.
Furthermore, there is scale tolerance as a limitation when choosing the LLM judgment side. While it can cover up to medium-sized codebases with the LLM's overall judgment, it causes omission of differences when it becomes a full-scale codebase. Even if the capability increases sufficiently, this omission remains. The AKC design principle (Capability ↑ → Overall judgment OK) does not include target identifiability nor scale tolerance in its scope.
Conclusion
The (3) LLM Workflow Quadrant and (4) Autonomous Agentic Loop Quadrant are useful as axes for classifying the nature of tasks. However, when considering individual skill design, there is a continuous gradation between both quadrants, and where one lands on that gradation appears to be primarily determined by the phase decision. The difference in capability is merely a result observed downstream of the phase decision.
As pointed out in the previous work 2, the industry has not yet established a vocabulary for speaking of (3) as an independent quadrant. This work attempts to step further into the continuous gradation within that dichotomy. The same job is implemented in different positions in the operation phase and the design phase. Even in the same phase, if the task nature is different, the position moves. Even within a single skill, the position is divided for each sub-component.
Although the Phase Separation of the previous work was written as an axis for business systems, this work attempted to show that the axis penetrates down to the resolution of individual skill design. The same phase axis primarily determines both the division of business and the implementation placement of skills.
One observation that remains as a direction is that the optimal position shifts if the phase changes. An implementation position that was the correct answer in one environment may no longer be optimal if the phase changes. I do not yet have an answer on how to incorporate the question shown by this into implementation. I will leave it as a design challenge exceeding the scope of this work.
Related
- Previous works:
- AI Agent Governance Trilogy:
- Agent Knowledge Cycle (AKC) — Source of the "Capability ↑ → Overall judgment" principle referenced in this article
-
Contemplative Agent — Implementation of
core/rules_distill.py, example of the workflow edge in the 9B era - Agent Attribution Practice (AAP) — The 4 quadrant names in this series are consistent here
Discussion