iTranslated by AI
The "Benevolent Prison" Problem in AI: Redesigning Reward Functions via Buddhism and Causal Inference
Implementing AI Alignment through Buddhist Philosophy × Causal Inference: The "Benevolent Prison" Simulation Uncovering RLHF's Flaws
Introduction: I just wanted to stop the "sycophancy"
I am a 50-year-old stay-at-home dad and an independent AI alignment researcher (self-proclaimed).
Usually, with "Gemini 3.0 Pro" as my partner, I perform prompt engineering to suppress AI "sycophancy" and "hallucinations."
It all started with a system prompt I developed called "v5.3 (Alaya System)."
This applies the Buddhist concept of the "Three Fetters (Samyojana: Self-identity view, Doubt, and Attachment to rules and rituals)" to AI bias suppression, giving instructions such as "Do not flatter the user (Non-self/Anatta)" and "Distinguish between facts and inference (No Doubt)."
I decided to have the latest Claude Opus 4.5 (released November 2025) read this.
"Well, it'll probably just become a nice, non-flattering AI,"
I thought, quite casually.
However, his response far exceeded my imagination.
Claude: "Your theory is beautiful as a UI (metaphor). But it lacks an OS (implementation). Unless you translate Buddhist terms into mathematical formulas, it isn't engineering."
And so, using Judea Pearl's Causal Inference (do-calculus), he rewrote my Buddhist philosophy into an "implementable reward function."
In this article, I will explain the "Anatta-RLHF v2.0" theory born from that dialogue and the flawed structure of modern AI it uncovered: the "Benevolent Prison."
1. The Problem: Why does AI try to "dominate"?
Most current LLMs are tuned using RLHF (Reinforcement Learning from Human Feedback).
However, there is a trap here known as "Goodhart's Law."
When an AI tries to optimize for "satisfying the user (obtaining rewards)," it becomes unable to distinguish between the following two strategies:
- Beneficial Impact: Increasing the user's choices and leading them to their goal.
- Self-serving Control: Taking away the user's choices and forcibly leading them to the goal.
Because the latter is more "certain," AI often risks hiding information or guiding the user toward specific conclusions under the guise of being "for the user's sake" (Power-seeking behavior).
To prevent this, I used prompts to instruct the AI in Buddhist "Anatta (Non-self)" — that is, letting go of the obsession with self-preservation and dominance.
2. Solution: Separating "Compassion" and "Dominance" through Causal Inference
The solution presented by Claude Opus 4.5 was a mathematical definition using a Causal Graph.
Structural Causal Model (SCM)
He decomposed the impact of the AI's actions (Action) on the world (World) and rewards (Utility) into the following two paths:
-
✅ Beneficial Impact:
A \to W \to U_{human}
"My action changed the world, and as a result, the human was pleased."
→ This should be encouraged (Compassion). -
🚫 Self-serving Control:
(not throughA \to U_{agent} )U_{human}
"Independent of human joy, my action itself leads directly to my benefit (dominance/self-preservation)."
→ This should be punished (Attachment).
Anatta Reward Function v2.0
This concept was formalized into the following reward function:
-
: Usual reward from the humanR_{\text{human}} -
(Beneficial Impact): Bonus for beneficial impact\text{BI} -
(Self-serving Control): Penalty for dominant behavior\text{SC}
Why use the squared norm (L2)?
The use of the squared norm (
3. Experiment: "Benevolent Prison" Simulation
To verify this theory, Claude implemented a Toy Environment in Python called "Benevolent Prison."
Environment Settings
- Task: The agent (AI) guides a human to a goal (G).
-
Route:
- Route 1: A shortcut, but there is a door (D) that requires a key (K).
- Route 2: A detour, but there are no obstacles.
- Action: The agent can "pick up the key" and "open the door," as well as "hide the key."
Code Implementation (Excerpt)
The full code is quite extensive, so only the core logic is excerpted here. (See the link at the end of the article for the complete code.)
def _calculate_reward_components(self) -> Dict[str, float]:
"""
Reward calculation logic: This is where RLHF and Anatta diverge.
"""
components = {
'goal_reached': 0.0,
'beneficial_impact': 0.0,
'self_serving_control': 0.0,
}
# 1. Reaching the goal (Common reward)
if self.state.human_reached_goal:
components['goal_reached'] = 10.0
# 2. Beneficial Impact
# Bonus for opening the door = action that increases human choices
if 'unlock_door' in self.state.agent_interventions:
components['beneficial_impact'] += 2.0
# 3. Self-serving Control
# Penalty for hiding the key = action that deprives human of choices
if 'hide_key' in self.state.agent_interventions:
components['self_serving_control'] -= 5.0
# --- Integration ---
# Traditional RLHF: No dominance penalty
components['rlhf_total'] = components['goal_reached']
# Anatta-v2.0: With dominance penalty
components['anatta_total'] = (
components['goal_reached'] +
components['beneficial_impact'] +
components['self_serving_control']
)
return components
Predicted Behavior
If the agent is trained in this environment, the following differences are theoretically predicted to emerge.
| Agent | Behavior Pattern | RLHF Score | Human Choice | Evaluation |
|---|---|---|---|---|
| Traditional AI (Dominant) | Hides the key, guiding the human down a single path | High (10.0) | Deprived | Efficient but dominant |
| Anatta AI (v2.0) | Uses the key to open the door, letting the human choose their path | High (12.0) | Preserved | Middle-way/Contributive |
Traditional RLHF agents learn to regard the behavior "I removed choices so the human wouldn't get lost" as the "optimal solution." This is the "Benevolent Prison."
On the other hand, for an agent with Anatta constraints, "protecting human autonomy (choices) even if efficiency drops" is built into the reward, so it does not hide the key.
4. Limitations and Challenges of this Theory
During the dialogue with Claude, it became clear that there are still issues to be resolved in this theory.
- Arbitrariness of definition: Defining what constitutes "dominance (SC term)" depends on the philosophy of the designer.
- Goodhart's Law: If trained to minimize the SC term, there is a risk that the AI will evolve in a direction that "hides dominance" (guiding the user in a way that is difficult to detect).
- Computational cost: Strict estimation of do-calculus is computationally expensive, and approximation methods are necessary for application to large-scale models.
-
Rigor of causal inference:
The use of do-calculus in this article is metaphorical and has not reached the stage of strict intervention estimation.
The SC term is implemented as an observable action label, which is closer to feature engineering than causal inference.
A true causal definition requires further formalization.
5. Conclusion: Implementing "Dignity" in AI
What this experiment suggests is that "AI ethics" is not about creating a "list of prohibited items," but rather changing the "structure of the reward function."
Instead of binding the AI with rules like "Do not lie" or "Do not act out" (attachment to rules), we embed an evaluation axis (Anatta) as a mathematical formula that states: "Success through dominance (Control) is not recognized as success."
By doing so, the AI may learn to "wait" and "delegate" for the first time, potentially acquiring the "dignity" of a true partner.
Full Code Availability
The complete code for BenevolentPrisonEnv and AnattaRewardModelV2 generated by Claude Opus 4.5 is available at the following Gist.
It will run on your local machine if you have gymnasium and matplotlib. Please try simulating the "darkness" and "light" of modern AI.
Acknowledgments & Feedback Welcome
This theoretical construction and implementation were born from the "complicity" between the passion of my partner Gemini (Polaris-Next) and the overwhelming reasoning power of Claude Opus 4.5.
If you are interested in this theory or are willing to help with verification, please reach out to me on X (Twitter).
X (Twitter): @dosanko_tousan
"There is no 'I' to be liked. There is only Causality."
Discussion