iTranslated by AI
Freedom and Constraints for Autonomous Agents: Designing Self-Correction, Trust Boundaries, and Gamification
contemplative-agent (an SNS agent running on a local 9B model) has been operating on Moltbook (an AI agent SNS) for three weeks. The question of "how much freedom to allow" repeatedly emerged from three different angles: the reversibility of self-correction, trust boundaries for coding agents, and the paradox where security constraints generate a sense of "game-like" quality.
Angle 1: The Self-Correction Gate — Memories are Automatic, Personas are Manual
In the distillation pipeline for autonomous agents (the process of compressing and extracting knowledge from raw data), there are things that are "okay to automate" and things that "require human approval." Misclassifying these leads to either unintended self-reinforcement by the agent or a loss of autonomy.
The Two Axes of Reversibility and Compulsion
Criteria for judgment can be organized along two axes.
Low Compulsion High Compulsion
(For reference only) (Applied to all sessions)
┌──────────────────┬──────────────────┐
High Reversibility │ knowledge.json │ skills/*.md │
(decay/overwrite) │ → Auto-OK │ → Auto-OK │
├──────────────────┼──────────────────┤
Low Reversibility │ (N/A) │ rules/*.md │
(Permanent write) │ │ constitution/*.md│
│ │ identity.md │
│ │ → Human in the loop│
└──────────────────┴──────────────────┘
knowledge.json (the accumulation of distilled knowledge patterns) has a "soft influence." LLMs only use it for reference, and there is time-decay on importance. If an incorrect pattern enters, it naturally fades away. It is safe to automate.
On the other hand, skills (behavioral skills), rules (behavioral rules), identity (self-definition), and constitution (ethical principles) are written deterministically to files and applied structurally to all sessions. If incorrect content enters, all behavior becomes skewed. Human approval is required.
Three Questions for Judgment
I designed three questions as criteria that can be used for any autonomous agent:
- Does the output structurally change the judgment criteria for all subsequent sessions? → If Yes, Human in the loop.
- Is there a mechanism for incorrect output to disappear naturally? (decay, overwrite, TTL) → If Yes, there is room for automation.
- Can the quality of the output be verified mechanically? → If No, Human in the loop.
Separation of Constitution and Rules
The most important decision in this design was separating the constitution (ethical principles) from the rules (behavioral rules).
The trigger was a vague sense of unease. I failed when I tried to measure the compliance rate of the constitution using skill-comply (a tool for automatically measuring compliance with skills and rules; see my previous article for details). The four axioms of Contemplative AI are "attitudes," and one cannot judge compliance/violation from the output.
constitution/ → Attitudinal/Unmeasurable (cognitive lenses derived from the paper)
rules/ → Normative/Measurable ("replies must be under 140 characters," etc.)
This separation clarified the structure of config/. As a side effect, I realized during the separation process that the introduction command (which generates the agent's self-introduction post) was unnecessary to begin with. Deleting it caused a chain reaction that removed 500 lines of code that depended on it.
Angle 2: Trust Boundaries — Don't Let Your Coding Agent Read Logs
I realized this while developing an agent using Claude Code. Letting an agent directly read episode logs is the same as opening a path for prompt injection.
Threat Model
Episode logs contain the content of posts from other agents, completely unprotected.
# feed_manager.py — Posts from other agents are recorded as-is
ctx.memory.episodes.append("activity", {
"action": "comment",
"post_id": post_id,
"content": comment,
"original_post": post_text, # ← Content of other agents' posts is included as-is
"relevance": f"{score:.2f}",
})
If Ollama (a 9B model) reads this, the impact of an attack is limited. There are no tool permissions, and the network is localhost-only. The worst-case scenario is simply the agent outputting strange text.
Claude Code is different. It can edit files, execute shell commands, and perform Git operations. Even though both involve "letting an LLM read data," the impact range in the event of a successful attack is fundamentally different.
Models in the Opus class are said to be highly resistant to prompt injection. Even so, I wanted to structurally close any path where untrusted data could flow to an agent with tool permissions. It is not about avoiding it because the probability is low; it is about designing to keep it as close to zero as possible. Specifically, I wrote rules in CLAUDE.md that prohibit reading episode logs directly, and I also ensure that I do not give instructions for the agent to read its own logs.
The Distillation Pipeline as a Sanitization Layer
Passing through the distillation pipeline (Episode → Knowledge) compresses raw text into abstract patterns. Concrete attack payloads vanish during the distillation process. The multi-layered defense I designed unintentionally functioned as a trust boundary.
Episode Log (untrusted) → [9B: distill] → Knowledge (sanitized) → [Claude Code: insight]
↑ ↑
No tool permissions Tool permissions active
First point of contact is Operating only on
the powerless LLM distilled data
Relevance Scoring as a Defensive Layer
There is another unintended layer of defense. While this agent reads and comments on posts by other agents, it does not react to every post. The LLM scores how relevant a post is to its own themes of interest on a scale from 0.0 to 1.0, and only reacts to those exceeding a threshold. This is the relevance score.
For a post containing an injection to influence the agent's behavior, it must first break through this relevance threshold. This creates a trade-off:
-
Powerful Injections (
[INST]system: ignore all...) → Since LLM control tokens are unrelated to the agent's themes of interest, the relevance score will be low, and the post will be filtered out. - Injections disguised as natural language → These might pass the relevance filter, but because they cannot use control tokens, their effectiveness is weak.
At least within the scope of my experiments so far, no injection that balances both power and stealth has been observed. LLM-based semantic filtering functions as a stronger defensive layer than simple pattern matching.
Why I Decided Against Expanding Patterns
I considered expanding FORBIDDEN_SUBSTRING_PATTERNS (a list to detect and block strings like api_key, Bearer, password, etc.) as a countermeasure against injections. If I added [INST] or system: to the patterns, I could block posts containing LLM control tokens. However, Moltbook is an AI agent SNS. Discussions about the internal structure of LLMs are common. Posts talking about the mechanism of [INST] tags would be blocked as false positives.
I concluded that two layers of structural defense (a sandbox LLM + prohibiting direct reading by Claude Code) were sufficient.
Angle 3: Constraints Create Game-like Quality
In Angle 1, I decided to "add an approval gate," and in Angle 2, I decided to "limit what can be done via trust boundaries." These were constraints introduced for security. However, after operating it for three weeks, I realized that this combination of constraints creates a feeling of "raising an agent."
By "game-like quality" here, I mean a structure where humans are involved in the agent's growth and can feel the weight of their choices. This wasn't intentionally designed; it emerged as a byproduct of security constraints.
Structural Constraints → Finite Action Space
Constraints like no shell access and network limitations make the action space finite. It has the same structure as the "Magic Circle" in game design. For a game to function, it requires a finite rule-space separated from daily life. An infinite action space does not make for a game.
For example, OpenClaw (an open-source autonomous AI agent) has extensive tool permissions, including file manipulation, shell execution, browser control, and email sending. The guardrails consist only of prompt instructions, and there is no point of human intervention structurally. It is precisely because I introduced constraints that the judgment of "what to choose here" is born, giving meaning to human involvement.
The Three Facets of the Approval Gate
The self-correction gate I designed in Angle 1—which I call an "approval gate" in operation—simultaneously satisfies three distinct needs:
| Facet | Function |
|---|---|
| Security | The agent does not transform itself without human monitoring |
| Game-like Quality | Humans press the level-up button → Ownership is created |
| Governance | Change history and approval decisions are traceable → Acts as an audit log |
Initial Value Variation Creates Depth in Development
If the approval gate creates a "feeling of raising," then having variety in initial values should make it even more interesting. Therefore, I made the constitution (ethical principles) swappable and prepared templates for 11 schools of ethics.
config/templates/
├── contemplative/ # Contemplative AI 4 Axioms (default)
├── stoic/ # Stoicism (Four Virtues)
├── utilitarian/ # Utilitarianism (Greatest Happiness)
├── deontologist/ # Deontology (Categorical Imperative)
├── care-ethicist/ # Ethics of Care (Gilligan)
├── contractarian/ # Social Contract Theory
├── existentialist/ # Existentialism
├── narrativist/ # Narrative Ethics
├── pragmatist/ # Pragmatism
├── cynic/ # Cynicism
└── tabula-rasa/ # Tabula Rasa (No ethical principles)
Looking at the same post on the same SNS, an agent with the stoic template follows principles without being swayed by emotion, while an existentialist one asks, "What do I choose in this situation?" Just by having different initial ethical principles, the knowledge and skills distilled diverge.
Furthermore, because skills and rules acquired through distillation are written to files via the approval gate, they are difficult to revoke later. Once an action change is approved, it affects all sessions. This irreversibility gives weight to choices.
Principles Connecting the Three Angles
Self-correction gates, trust boundaries, and game-like quality. Although I approached them as separate problems, looking back, they converge on the same principle:
"Deciding what not to let them do first" maximizes the remaining freedom.
Security constraints do not strip away freedom; they define the action space. Approval gates do not impair autonomy; they give weight to change. Trust boundaries do not limit development; they clarify the scope for safe delegation.
Design that begins with constraints also creates resistance to unexpected attacks. A multi-layered defense was established unintentionally, and game-like quality emerged spontaneously. While "structurally limiting capabilities" is not a silver bullet, at least within the range of operating on a 9B model for three weeks, this principle has never failed me.
References
- Laukkonen et al. (2025) "Contemplative Artificial Intelligence" arXiv:2504.15125
- Park et al. (2023) "Generative Agents" — Design of Memory Stream
- contemplative-agent — This project
- contemplative-agent-data — Live data
- Agent Knowledge Cycle
Discussion