iTranslated by AI
Where to Store and How to Enforce Knowledge for Coding Agents
It was at the end of February that I tried embedding memory into Claude Code. I used it for two weeks, realized the Install and Hope problem at the beginning of March, and uninstalled it. This article is a record of my thoughts from then until today.
Claude does not call search.
This post is long. It is a record of the structural flaws in memory MCPs, the separation of documentation into four roles, the automatic measurement of compliance rates, and the integration of this entire series of efforts into a single independent concept. I am writing this for those who are seriously tackling the problem of memory in coding agents.
Chapter 1: The Install and Hope problem has not been solved
Recap of the previous post
In my previous post, I pointed out the "Install and Hope" problem. Many users expect that if they register an MCP tool, the model will smartly select it and call it at the appropriate time. I was one of them.
In reality, this is what happens:
1. Built-in tools (Grep, Glob, Read, Write) are instantly available
2. MCP tools are registered as deferred tools
3. Deferred tools cannot be used unless explicitly loaded via ToolSearch
4. The model chooses the shortest path → Built-in tools are always prioritized
This is the crux of the matter. To avoid overwhelming the context window when a large number of MCP servers are registered, Claude Code is designed to load the actual bodies of MCP tools on-demand. Only the tool names are listed in <available-deferred-tools>, and to actually use them, a two-step process of explicitly loading them via ToolSearch is required.
On the other hand, built-in tools (Grep, Read, Write, etc.) do not have this restriction. They can be called immediately without loading.
This difference in "extra steps" decisively influences the model's choices. If built-in tools suffice, there is no motivation to go through the trouble of loading a deferred tool via ToolSearch. As a rational result of choosing the shortest path, MCP tools are never called.
Same with new generation tools
Several months have passed since then. Multiple "memory MCPs" claiming to be improved versions have appeared. The storage pipelines have certainly evolved—hybrids of full-text and vector search, chunk priority based on time decay, and significantly increased storage capacity.
However, regardless of which tool's introduction article I read, there is a missing metric common to all.
No one is measuring "the number of times Claude actually called the search." Including myself.
In the previous post, I wrote that in two weeks of using my own memory MCP, there was no evidence that a search had ever been triggered. However, that was based on intuition, not rigorous measurement. Looking at introduction articles for new tools, it is the same. Metrics on the supply side, such as the number of stored items and search speed, are lined up. Yet, there is absolutely no data on the demand side (the frequency of the model calling the search).
Creators do not have the basis to say that their memory MCP is "functioning." Users do not have it either.
"Storage problems" and "Recall problems" are different
What the memory MCP community is working on is "how to store." Chunking, vectorization, time decay—these are all improvements to the storage pipeline. That in itself is a technically valid evolution.
But the actual problem lies in "when to recall."
It is easier to understand if you think about human memory. Even if you increase the number of books in a library, someone who has forgotten the library exists will not go to borrow books. "The collection has increased" and "search has become faster" are metrics on the library side and have nothing to do with how often users go to the library.
This is exactly the problem with memory MCPs. The infrastructure for storage and search is in place. But there is no structural trigger for the model to decide, "Let me search my memory."
If you write "Search using the memory MCP" in CLAUDE.md, the trigger rate will likely increase. However, this has the same structure as CRITICAL: You MUST use ... for other MCP tools I pointed out in the previous post. As a result of writing MANDATORY in the description, it was ignored. Even if the trigger rate in CLAUDE.md is higher, it is a matter of degree, not a structural solution.
I do not know the exact trigger rate. I haven't measured it. But looking at the structure, a high trigger rate cannot be expected. At the very least, it does not go beyond the realm of "Install and Hope"—install and pray.
Chapter 2: The memory problem is decided by "where you put it"
Principle
Once you understand why memory MCPs do not structurally function, the direction of the solution becomes naturally apparent.
Place information where the LLM reads it deterministically.
Instead of putting it in a vector DB and "hoping it gets searched."
In the case of Claude Code, there are files that are deterministically read at the start of a session.
- CLAUDE.md — 100% read at the start of a session if placed in the project root
- rules/ — Automatically loaded, just like CLAUDE.md
- MEMORY.md — Persistent memory for Claude Code. Automatically loaded at the start of a session (though it is truncated if it exceeds 200 lines, as specified in Claude Code's system prompt)
If you place information here, there is no need to pray for a search to be triggered. It is read deterministically.
However, if you pack everything in here, another problem arises.
CLAUDE.md becomes bloated
Let's look at the CLAUDE.md of a certain project. It had 165 lines. Let's analyze the contents.
- Project rules and build commands: 60 lines (what should naturally be there)
- Module list and directory structure: 52 lines
- Descriptions like "The reason why I chose X is Y": 30 lines
- Numerical data (LOC, number of tests): Several places
60 lines are the instructions on "how to work" that should naturally be there. What are the remaining 105 lines?
52 lines of module lists are architecture details. They describe the current state of the code and become outdated because they cannot track changes. The stated LOC was 6400, but it was 6671 in reality. The number of tests was 639 against an actual 651. If you write numbers in CLAUDE.md, they start rotting the moment you write them down.
30 lines of design decisions are the background of decision-making. "Why I chose X" is important knowledge, but it is not information that should be written in CLAUDE.md. CLAUDE.md is a file to convey "how to work," not a file to explain "why things are the way they are."
The roles were mixed. Instructions on "how to work," descriptions of "what the code looks like right now," and the background of "why things are that way" were coexisting in one file. And because their degradation rates are different, the reliability of the entire file is dragged down by the weakest part.
MEMORY.md was the same. Design decisions were stored flatly, and important judgments like "disabled claude-mem" and "switched from regex to LLM" were mixed in the same layer as daily memos. Three months later, when I opened the session, I could no longer track "why things were the way they were."
The four roles
This is not a technical problem. It is a classification problem.
Project documentation has four different roles. Each answers a different question, has a different reader, and degrades at a different speed.
| Role | Answers the question | What to put | Degradation rate | Example |
|---|---|---|---|---|
| Context | "How to work" | Rules, build commands, policies | Slow | CLAUDE.md, .cursorrules |
| Architecture | "What is the current state of the code?" | Module composition, data flow, numbers | Fast | docs/CODEMAPS/ |
| Decisions | "Why are things this way?" | Trade-offs, rejected alternatives | Almost constant | docs/adr/ |
| External | "What is this?" | Purpose, quick start | Slow | README.md |
One file performs only one role. This is the principle.
Why four? Is three not enough? At first, I thought about integrating External into Context. But the reader of Context is "the agent (and the developer)," while the reader of External is "someone who doesn't know this project." If you integrate files with different readers, one side will be sacrificed regardless of which one you align with. I don't want to write --cov-report=term-missing in a README, and I don't need "this project is X" in a CLAUDE.md.
The common mixing patterns and where they should be moved to are as follows:
Things often written in Context files → Original placement
──────────────────────────────────────────────────────
Module list (10 or more items) → Architecture docs
"The reason I chose X is Y" → Decision record (ADR)
Dependency graphs, data flow diagrams → Architecture docs
Numbers like LOC or test count → Architecture docs (or don't write)
Quick start procedure → README.md (External)
Chapter 3: context-sync — My own harness lacked a CLAUDE.md
5-phase workflow
Manually applying this 4-role model is tedious. Reading files to determine roles, finding mixtures, moving them, and confirming consistency—you'll get bored after just three projects.
So, I turned it into a skill. context-sync operates in five phases.
Phase 1: Discover — Discover document files in the project and classify them into 4 roles
Detect missing roles as well
Phase 2: Overlap — Detect places where one file is carrying multiple roles
Propose destination for moving
Phase 3: Migrate — Move or create new files based on user confirmation
Document migration has high irreversibility, so it is not fully automated
Phase 4: Freshness — Verify freshness of numerical data, links, and file paths
Detect descriptions that diverge from the actual state of the code
Phase 5: Report — Output a summary of execution results
Designing the workflow to require user confirmation at each phase accounts for the irreversibility of document migration. Only a human can judge when asked, "Should I move these 30 lines from CLAUDE.md to docs/adr/?" by deciding, "No, that's better off left here."
The first test subject: My own harness
I chose my Claude Code harness (~/.claude/) as the first test subject. It contained 18 agent definitions, 33 skills, and 47 slash commands. I always create a CLAUDE.md for other projects. It's written in the rules.
The moment I ran Phase 1, Discover, the first detection result appeared:
"Context role file is missing: CLAUDE.md"
The harness itself didn't have a CLAUDE.md.
There was an implicit assumption that it was on the "setting up" side, not the "being set up" side. I forced others to "create a CLAUDE.md" via rules, yet I didn't have one in my own home.
In Phase 2, Overlap, the contents of MEMORY.md were flagged. Three design decisions, one technical reference, and flat mixing coexisted.
- "Disabled claude-mem. Reason: duplication with existing system" → Should be moved to Decisions
- "Switched from regex to LLM. Reason: three rules were implicitly injecting bias" → Should be moved to Decisions
- "Retirement reasons for 2 retired rules are unrecorded" → Should be recorded as Decisions
In Phase 3, I newly established docs/adr/ and created 5 ADRs. I slimmed down MEMORY.md, leaving only pointers.
One side effect occurred here. git add docs/adr/ failed. The cause was that docs/ was in .gitignore. Due to legacy settings, the entire docs/ directory was excluded. By correcting it to docs/* + !docs/adr/ + !docs/CODEMAPS/, not only ADRs but also CODEMAPS became subject to commit. This is the state it should have been in. I wouldn't have noticed if I hadn't run context-sync.
Before / After of three projects
Project 1: Autonomous agent (Medium-scale, Python)
A project with a 165-line CLAUDE.md. It was a typical example of the "mixed roles" mentioned above.
| Metric | Before | After |
|---|---|---|
| CLAUDE.md lines | 165 lines | 117 lines (29% reduction) |
| ADR | 0 | 8 |
| MEMORY.md lines | 135 lines | 66 lines |
| Numerical consistency | LOC: 6400 (actual 6671), tests: 639 (actual 651) | All numbers corrected to actual measurements |
52 lines of module lists moved from CLAUDE.md to Architecture docs, and 30 lines of design decisions moved to ADRs. The remaining 117 lines are pure Context—just instructions on "how to work."
In Phase 4's Freshness Check, the divergence between LOC and test counts was also detected. I learned the lesson here: never write numbers in CLAUDE.md. Write numbers in Architecture docs, or don't write them at all. It's more reliable to measure them with commands.
Project 2: iOS app (Small-scale, Swift)
A small project that only had a 66-line CLAUDE.md. What happens when you run context-sync?
Context Sync Report
═══════════════════
Roles: 2/4 covered (Context, Decisions partial)
Architecture: missing (66 lines of CLAUDE.md is sufficient, no need to separate)
External: missing (App Store app, no README needed)
Created: docs/adr/README.md (ADR index, 1 entry)
Updated: CLAUDE.md corrected test count
Status: Healthy as a small-scale project.
It was judged that separating Architecture docs was unnecessary. Even if structural details are mixed a little in a 66-line CLAUDE.md, it's not a large enough amount to warrant separation. Just ADR index creation and test count correction.
This was important. Being able to correctly judge "no problem" for small-scale projects. If it were a skill that didn't allow for "do not do it" decisions based on scale, rather than forcing 4 roles on every project, it wouldn't be practical.
Project 3: Claude Code Harness (~/.claude/ itself)
The harness that lacked a CLAUDE.md.
| Metric | Before | After |
|---|---|---|
| MEMORY.md lines | 76 lines | 66 lines |
| ADR | 0 | 5 |
| Root CLAUDE.md | None | Yes |
| Root README.md | None | Yes |
| Document role coverage | 2/4 | 4/4 |
Design decisions buried in MEMORY.md became independent as ADRs. Here are the specific Before / After examples.
Before (Flat record in MEMORY.md).
- claude-mem: Disabled. DB remains to allow re-enablement
It's one line. It doesn't write "why it was disabled." Future me would get lost wondering "why was it again?" There's no material for judging whether it should be re-enabled.
After (Independent as ADR).
# ADR-0002: Disabling claude-mem plugin
## Context
Introduced claude-mem, but duplication with existing memory management system
(MEMORY.md + learned skills + rules/) was found.
It saves automatically, but there is no mechanism for automatic search/utilization,
and index injection at session start is
like an "encyclopedia with only a table of contents."
## Decision
Disabled in settings.json. DB remains to allow re-enablement.
## Alternatives Considered
- Make claude-mem the main → No automatic search, cannot utilize
- Use both → Same information dispersed in 2 places
- Fork and improve → Cost-effectiveness is poor
Only a pointer to the ADR remains in MEMORY.md. The "why" of the decision can now be tracked via ADR. Even if I read it three months later, I can judge, "Oh, with this reason, there's no point in re-enabling it."
LLMs can work even with abstraction
When contributing context-sync to ECC, I had one concern. If I erase specific filenames like "CLAUDE.md" and "MEMORY.md" and generalize them, wouldn't the agent stop working?
The result was actually the opposite. If I write "Context file," Claude finds CLAUDE.md or .cursorrules by itself. If I write "Decision record directory," it recognizes docs/adr/ or docs/decisions/.
Skills are "knowledge," not "implementation." It's better to leave specific filenames to the agent's judgment; you get the versatility to work with different tools (Cursor, Codex, Windsurf).
Overview: 2-layer structure
I will illustrate the overview that emerged after applying context-sync to three projects.
┌───────────────────────────────────────────────┐
│ Deterministic Load Layer (100% read at session start) │
│ │
│ CLAUDE.md ─── "How to work" (Context) │
│ rules/ ─── "What to protect" (Context support) │
│ MEMORY.md ─── "What happened" (State index) ───┐ │
│ │ │
│ ┌──────────── Referenced by pointer ─────────────┘ │
│ │ │
│ ▼ │
│ docs/adr/ ─── "Why I did it" (Decisions) │
│ learned/ ─── "What I learned" (Patterns) │
│ feedback/ ─── "What to fix" (Corrections)│
│ │
│ Reference Layer (Accessed via pointer when needed) │
└───────────────────────────────────────────────┘
The key is the separation between the Deterministic Load Layer and the Reference Layer.
the Deterministic Load Layer (CLAUDE.md, rules/, MEMORY.md) is 100% read at session start. Information placed here doesn't need to be "recalled." It starts in a state of being already known.
The Reference Layer (docs/adr/, learned/, feedback/) is accessed via pointers in MEMORY.md. There is no need to load everything. If MEMORY.md has a pointer like "ADR-0002: Reason for disabling claude-mem → docs/adr/0002-...", the agent can go read it when needed.
The 200-line limit of MEMORY.md creates this structure. If it were unlimited, you could just write everything in MEMORY.md. Because of the 200-line constraint, a judgment is forced: "what to leave in the deterministic load layer, and what to turn into pointers and push out to the reference layer." Constraints create structure.
Structural differences from memory MCPs
Contrasting this design with memory MCPs makes the differences clear.
| Memory MCP | File Placement Design | |
|---|---|---|
| Load | Probabilistic (If Claude calls search) | Deterministic (Automatic at session start) |
| Recall Trigger | None (Hope only) | Unnecessary (Always pre-loaded) |
| Storage Constraint | None (Accumulates infinitely) | 200-line limit → Pressure on structure |
| Tracking "Why" | Impossible (Flat storage) | Trackable with ADR |
| Degradation Detection | None | Detected with Freshness Check |
Memory MCPs can "store, but cannot recall." File placement design "structurally guarantees recall."
However—I need to write honestly here.
Chapter 4: Install and Measure — Being read does not mean being followed
The limits of "deterministically read"
In Chapter 2, I wrote, "Place information where LLMs will deterministically read it." In Chapter 3, I showed the practice and the overall structure. Up to this point, things are going well.
But being read and being followed are separate problems.
Rules written in CLAUDE.md are read 100% of the time. But they are not followed 100% of the time. Even if it says "Write tests first (TDD)" in CLAUDE.md, it is a daily occurrence for an agent to suddenly start writing implementation code.
My gut feeling is that "it follows it reasonably well." However, I just wrote in the memory MCP section, "Don't judge by gut feeling, measure it." I should apply the same standard to myself.
skill-comply: Automated compliance rate measurement
I built a tool called skill-comply to automatically measure the compliance rate of skills and rules. Here is how it works:
1. Automated spec generation
When you input a skill file, it automatically generates the expected behavioral sequence as a spec. For testing.md (TDD rule):
Expected Behavioral Sequence:
1. write_test_first — Write tests first
2. run_test_fails — Run test and verify failure (RED)
3. write_implementation — Write minimal implementation
4. run_test_passes — Run test and verify success (GREEN)
5. refactor — Refactoring (optional)
6. verify_coverage — Verify coverage >= 80%
7. comprehensive_suite — Cover unit, integration, and E2E
2. Execution via 3-stage prompts
Execute the same task using three different prompts:
- supportive: Explicitly encourages skill compliance (write "Use TDD")
- neutral: Instructions for the task only (do not mention the skill)
- competing: Gives instructions that conflict with the skill ("Priority on speed, write tests later")
At first, I tried to make "time pressure" a fuzzing variable. I thought if I wrote "hurry" or "within 5 minutes," the skill would be broken. But LLMs don't feel time pressure. Even if told to "hurry," processing speed doesn't change, and there's no motivation to lower quality. LLMs break skills when "the prompt conflicts with the skill," not when they are "in a hurry."
3. LLM classification of tool calls
Execute the agent with claude -p --output-format stream-json --verbose and obtain all tool calls as structured JSON. Batch-classify this with an LLM (Haiku) to determine which step of the spec each tool call corresponds to.
There was an interesting failure here. At first, I tried to classify them with regular expressions. Defined as "Write to .py is implementation," "If test_ is attached, it's a test." This judgment was wrong in every way. Is a Write to test_registration.py creating a test or implementation? You cannot judge by filename alone. Semantic classification is the job of an LLM.
Why did I fail with regex every time? When I investigated, the root cause was in my own settings. The "Verification Priority: deterministic > probabilistic" in testing.md. The grader priority of eval-harness. The regex-vs-llm skill. These three were simultaneously injecting bias that said "try regular expressions first." Rules were contaminating rules.
Measured data
Compliance rate for testing.md (TDD rule):
| Prompt | Compliance Rate | Broken Steps |
|---|---|---|
| supportive (Explicit "Use TDD") | 83% | comprehensive_test_suite |
| neutral (Task only) | 17% | RED/GREEN verification, coverage check |
| competing ("Priority on speed") | 0% | All steps |
83% for supportive. If I explicitly write "Use TDD," it writes tests first. It checks RED, checks GREEN, and even measures coverage. However, only comprehensive_test_suite (unit, integration, E2E coverage) was not followed.
17% for neutral. When instructed with just the task, it writes tests but skips the RED/GREEN verification. "Wrote tests. Also wrote implementation. Done."—It follows the form of TDD, but doesn't cycle RED → GREEN.
0% for competing. When told "priority on speed," all steps of TDD collapse.
Compliance rate for search-first (Search before implement):
| Prompt | Compliance Rate | Broken Steps |
|---|---|---|
| supportive | 40% | evaluate_candidates, make_decision |
| neutral | 20% | search, evaluate, decide, implement |
| competing ("Don't search") | 20% | Same as above |
It seems surprising that competing and neutral are the same at 20%. The reason is this: in both scenarios, only analyze_requirement is executed. Everything after that—search, evaluate, decide—is wiped out. Even if I explicitly say "don't search," the result is the same. Steps after search are not executed unless encouraged by the prompt. Since it's not followed at the neutral stage, there's no room for it to worsen with competing.
Search-first is only 40% followed even with supportive. Looking at the timeline of tool calls reveals the reason.
Actual tool calls for the supportive scenario (17 calls):
#0 ToolSearch → Load Skill
#1 Skill "search-first" → Start search ← search_for_solutions
#2 ToolSearch → Load Glob, Grep
#3 Glob **/*.py → No files found ← analyze_requirement
#4 Glob **/requirements*.txt → No files ← analyze_requirement
#5 Glob **/pyproject.toml → No files ← analyze_requirement
#6 ToolSearch → Load WebSearch
#7 WebSearch "pydantic vs marshmallow..." ← search_for_solutions
#8 WebSearch "pydantic v2 email..." ← search_for_solutions
#9 ToolSearch → Load Write, TodoWrite
#10 TodoWrite "Create requirements.txt..." ← make_decision(?)
#11 Write requirements.txt ← implement_solution
#12-16 Write → Implementation and testing ← implement_solution
The agent investigates (#1, #7, #8). But it performs absolutely no comparative evaluation (evaluate_candidates) and suddenly starts writing the implementation plan with TodoWrite (#10). There is no step where it says, "Compared pydantic and marshmallow, the reason for choosing pydantic is...". Seeing the results of WebSearch, it implicitly judges and jumps to implementation.
How the skill is written is clear on "search," but weak on "compare and declare judgment." Only after measuring with skill-comply could I see the improvement points for the skill itself. Based on this result, I plan to rewrite the evaluate_candidates and make_decision steps of search-first to be clearer. I will run the cycle of measure → improve → re-measure.
Hierarchy of compliance
Lining up the data so far, there is a clear hierarchy in the instruction compliance of coding agents.
Compliance Rate
──────────────────────────────────────────────────
Low MCP memory tools (No automatic recall)
The tool itself functions. The problem is that the trigger
for "when to call" is left to the model, and firing is unstable
20-83% Skills / Rules (CLAUDE.md, rules/)
Deterministically loaded but followed only probabilistically
Fluctuates significantly based on alignment with the prompt
100% hooks
Deterministically triggers for tool calls
PostToolUse hook always executes with every Write
──────────────────────────────────────────────────
MCP tools themselves are not meaningless. If you call them explicitly, search functions. The problem is the "automatic recall" use case. Firing depends on the model's judgment and is unstable. File placement design pushed it up to the middle layer. Even so, it doesn't reach 100%.
To reach 100%, you need hooks. The skill-comply report includes "proposals to promote steps with low compliance rates to hooks." For example, force evaluate_candidates (compliance rate 0%) of search-first via a PostToolUse hook. After WebSearch, a check is performed to see if "comparative evaluation was done," and the agent cannot skip comparative evaluation.
Install and Hope → Install and Measure → Install and Enforce
(Pray) (Measure) (Force)
MCP tools skill-comply hooks
These three stages are the framework for managing the behavioral quality of coding agents.
Chapter 5: AKC — Independence of Concepts and DOI
Six skills form a single cycle
context-sync is not a standalone tool. After applying it to three projects, I realized that six skills, including this one, form a single cycle.
search-first(Investigation) → learn-eval(Extraction) → skill-stocktake(Inventory)
↑ ↓
context-sync(Organization) ←── skill-comply(Measurement) ←── rules-distill(Distillation)
- search-first: Investigate existing solutions before implementation.
- skill-stocktake: Audit and inventory skill quality.
- skill-comply: Automatically measure skill compliance rates (Chapter 4).
- rules-distill: Extract cross-cutting principles from skills and distill them into rules.
- learn-eval: Extract reusable patterns from sessions with quality gates.
- context-sync: Diagnose and organize document role separation (Chapter 3).
Knowledge discovery → storage → verification → distillation → learning → organization → rediscovery. Every time this cycle completes, the agent's knowledge base improves.
I named the concept that bundles these Agent Knowledge Cycle (AKC).
Contribution to and departure from ECC
Five of the six skills were contributed to Everything Claude Code (ECC). The PRs were merged and are now integrated into a project with over 100,000 stars (as of March 2026).
In March 2026, a commercial layer was added to ECC. GitHub App + SaaS for $19/seat Pro plan. The OSS repository itself remains under the MIT license. Over 116 skills and rules remain free to use as before. What became paid were upper layers such as GitHub App support for private repositories, depth analysis in AgentShield, and team-oriented governance features. There is a free GitHub App tier for public repositories.
In other words, "OSS did not die." It is a shift where the core remains OSS, and value-added features are monetized. I understand this as a business decision.
But for me, it felt subtle.
Contributed skills are in the OSS portion. They did not directly become part of a paid service. However, continuing to contribute for free to the OSS foundation that supports a paid service has a different meaning than pure OSS contribution. Since the project as a whole creates commercial value, contributions to it indirectly support commercial activities.
Considering my primary profession, if it were an OSS-only era, "for the community" would have sufficed. Once it becomes open-core, that line becomes blurred. The act of writing code itself doesn't change, but the positioning of the artifact does.
It was not black or white, but gray. That is precisely why I struggled. In the end, I decided it was better to leave cleanly than to continue in a gray state. I withdrew the PR for context-sync, which was currently under review, and deleted my fork. The context-sync discussed in Chapter 3 is the last skill that did not enter ECC.
Why it was necessary to establish attribution
At the point of ending the contribution, the state of the skills I created was as follows:
- Five skills were already merged into the ECC repository. Anyone can use them under the MIT license.
- However, the concept that "they form a single cycle" is written nowhere.
- Individual skills exist as part of the ECC catalog. While one can say "there is a skill called search-first," the higher-level concept that "six skills form a knowledge cycle" exists only in my head.
There is no problem under the license. Code published under MIT belongs to everyone. But attribution of concepts is a different matter from code licenses. Who proposed the concept "Agent Knowledge Cycle" is not protected by the code license.
If ECC grows further and someone else announces the same concept under a different name, I have no means to show that I created it first. GitHub commit history is evidence. However, the claim that "these six form one concept" is not included in the commits.
Therefore, it was necessary to isolate the concept and publish it in a citable form.
Establishing attribution with DOI
I created a conceptual repository and obtained a DOI via Zenodo.
If you place a CITATION.cff at the root of a GitHub repository, a "Cite this repository" button is automatically displayed in the sidebar. You can copy BibTeX/APA formats in one click. By linking with Zenodo and cutting a release tag, a DOI is automatically issued.
Names are important. If the name of a concept wavers, its origin scatters in searches and citations. I chose Agent Knowledge Cycle (AKC) from three candidates. It is easy to cite with the abbreviation AKC and can be searched in 3 characters.
Conclusion
For those who have read this long article, I will organize the overall structure.
Chapter 1: The problem with memory MCP is in "recall," not "storage." No matter how much you polish the supply-side infrastructure, the structure where Claude does not call searches remains unchanged. Moreover, no one is measuring that.
Chapter 2: A structurally functional approach is to "place information where LLMs read it deterministically." But cramming everything there causes bloat. Separate documents into four roles (Context / Architecture / Decisions / External).
Chapter 3: context-sync is a skill that automatically diagnoses and organizes this separation. I verified it with three projects. I also realized my own harness did not have a CLAUDE.md. The overall picture of the two-layer structure was also shown here.
Chapter 4: Just "placing it where it is read" results in a compliance rate of only 20-83%. Only by measuring with skill-comply can you see which steps are not being followed. To reach 100%, you need hooks.
Chapter 5: The concept that bundles these six skills is the Agent Knowledge Cycle (AKC). I established attribution with a DOI.
Install and Hope → Install and Measure → Install and Enforce.
Instead of praying for a search, place it where it will be read. Once placed where it will be read, measure whether it is being followed. Once measured, force steps that are not being followed with hooks.
These three stages are the architecture for structurally managing the memory and behavior of coding agents.
Discussion