iTranslated by AI
AI with Memory: Now Starting to Organize its Own Memories
In my previous article, I wrote about how Claude Managed Agents have been separated into three layers: "brain / hands / session." Among these, I argued that the fact that the "session has been moved externally"—meaning memory is now persisted in an external store—fundamentally changes how AI agents are used.
This is the follow-up to that. As the next step after AI agents "gained the ability to have memory," I have summarized my thoughts after reading about a feature announced by Anthropic called Dreaming. It's a great name, isn't it?
"Having memory" and "being able to use memory" are different things
With the externalization of sessions, AI can now hold onto memories. However, memory cannot be used just by piling it up.
It is the same for humans. Even if you write in a diary every day, you will repeat the same mistakes if you don't look back and organize your thoughts. Conversely, people who consciously verbalize what worked well can apply those lessons to their next task. Capable people are usually good at this.
AI faced the same structural challenge. While it became possible to maintain context across sessions, no one was managing the quality of that memory. Noise keeps increasing, and important insights get buried. Alas, I have experienced this several times in the past.
Dreaming is the answer to this problem.
What is Dreaming?
Dreaming is a mechanism where an agent autonomously reviews and curates accumulated logs and memories between sessions.
Specifically, it performs the following tasks:
- Detection of recurring mistakes: If the same type of failure occurs across multiple sessions, it extracts it as a pattern.
- Consolidation of effective workflows: It reinforces successful approaches as memory.
- Shared learning among team agents: In environments where multiple agents collaborate, it aggregates insights gained by individual agents into the team's shared memory.
- Cleaning up unnecessary memories: It deletes outdated information and noise to keep the memory "high quality."
In Anthropic's announcement, it was reported that when Dreaming was applied to Harvey, a legal AI, the task completion rate improved by approximately 6 times. Wow, that is surprising. Not just maintaining memory, but continuously improving the quality of memory makes such a significant difference.
As the name suggests, it is similar to the mechanism of "sleep and memory consolidation," where the human brain organizes and fixes memories while we sleep. They certainly gave it a clever name.
Another pillar: Outcomes (self-verification loop during execution)
If Dreaming is learning between sessions, Outcomes is quality control during a session.
Borrowing from the documentation, Outcomes is a feature that "upgrades a session from a conversation to work." When you define "what a finished state looks like" using a rubric (evaluation criteria), the agent autonomously repeats work and revisions until it meets those criteria. I suppose you could call it a "goal."
Mechanism: Separation of Writer and Grader
Internally, two roles are separated: the Writer and the Grader.
User defines the outcome
↓
Writer executes the task
↓
Grader evaluates the output in an independent context window
↓
satisfied → Complete
needs_revision → Return to Writer → Writer revises → Grader re-evaluates
max_iterations_reached → Exit (reached maximum iterations)
The Grader operates in a completely separate context window from the Writer. This is a critical design point, allowing the Grader to evaluate the output purely on its merits without being swayed by the Writer's thought processes or "excuses." It is quite clever.
The key is how to write the rubric
The rubric must be written in "evaluatable criteria." Vague criteria fall into the default failure mode where the grader passes everything.
# Bad example (too vague)
- The data looks correct
# Good example (evaluatable)
- The CSV has a price column containing numeric values
- Includes revenue forecasts using sales data from the past 5 years
- The calculation basis for WACC is explicitly stated
It is specific, as it should be. In work, the clearer the goals, the easier it is to act and the higher the results (based on personal experience). The rubric is sent via the user.define_outcome event. max_iterations can be specified from a default of 3 up to 20.
Actual usage example
The case study in the Anthropic cookbook is easy to understand. When setting a strict rubric for an agent creating a research brief on EV fast charging—such as "is each citation URL accessible," "does the quote match the original text exactly," and "are the 7 coverage items included"—it works as follows:
Draft 0 → Grader evaluation → needs_revision
"Missing quantitative value for demand charges ($/kW). The EVgo P&L
is citing a news article rather than an SEC filing."
Revision 1 → Grader evaluation → satisfied
"All 7 items covered. All citations are LIVE and match the original text."
This is an example where the criteria were cleared in one revision, but what is important is that this feedback is returned as a "specific explanation of why it failed." Because it clarifies "which criteria are not met and how" rather than saying it is "vaguely insufficient," the Writer can make precise corrections.
Relationship between Dreaming and Outcomes
The two features differ in their time axis.
- Outcomes: Ensures quality within a single session (iterate → grade → revise loop)
- Dreaming: Accumulates learning between multiple sessions (curation of memory)
Combining both allows you to "do this job properly now" and "do the next job even better." Anthropic really thinks things through, doesn't it? They expertly incorporate empirical rules from the real world.
How it differs from claude.ai's memory feature
One thing I want to clarify personally is the difference between Dreaming and claude.ai's memory feature. It is easy to confuse them, but their design philosophies are completely different.
| claude.ai Memory | Dreaming | |
|---|---|---|
| Target | End-user (individual) | AI products built by developers |
| What is remembered | User's own info (name, preferences, context) | Agent performance patterns |
| Who uses it | General users using claude.ai | Developers building in-house AI via SDK/API |
| Unit of learning | 1 user × accumulation of conversations | Pattern extraction across multiple sessions/agents |
In short, claude.ai's memory is a feature for "knowing the user," while Dreaming is a feature for "becoming better at tasks."
The former is about "who you are talking to," while the latter is about "how to perform a job well." The target layers are fundamentally different. If claude.ai's memory is the memory of a personal secretary, Dreaming is closer to a process where an organization continuously improves its operational manuals. ← I think I said that well.
How the implementation should look (probably)
From the current structure of the Managed Agents SDK and this announcement, we can infer the contours of the implementation.
The memory store will become a persistent layer separated from the session. Even after the session ends, the memory remains and can be referenced in the next session. It likely resembles an object like agent.memory with read/write access. This is easy to visualize, right?
Dreaming will trigger asynchronously after a session ends. It will likely be designed to control the schedule with a setting like dreaming_interval, or the platform side will trigger it automatically. Even without the user being conscious of it, the cleanup runs in the background between sessions. It is Dreaming, after all; it needs to organize while sleeping, so this is also easy to visualize.
In a multi-agent environment, it will be aggregated into a shared memory. When multiple agents are running in parallel, integrating individual learnings into a shared memory allows the entire team to realize a mechanism where they do not repeat the same mistakes. That is wonderful.
A "cleansing" layer has been born
As I wrote in the previous article, Managed Agents have been divided into three layers: brain (thinking), hands (acting), and session (remembering). I believe Dreaming has been added as a new fourth layer on top of those.
brain ── Thinking (inference)
hands ── Acting (tool execution)
session ── Remembering (memory persistence)
outcomes ── Verifying (quality loop during execution) ← This time ①
dreaming ── Cleansing (self-optimization of memory) ← This time ②
Too many layers? Well, that is fine. The OSI reference model served its purpose as a reference model, and in the end, we have TCP/IP. In any case, it is about moving from an AI that "thinks, acts, and remembers" to an AI that "verifies and cleanses."
In the context of enterprise and personal use that I wrote about previously, both features have meaning for both. For enterprises, it means "autonomously revising until reaching quality standards, and turning organizational tacit knowledge into explicit knowledge," and for personal use, it realizes an "AI that gets to know you better the more you use it, and gets better at work." Wow, it is becoming even more convenient.
Conclusion
The premise that "AI is reset when the session ends" will continue to change rapidly. It was tedious because it was stateless, but we are moving away from an AI that we could control with a fresh mindset every time. I wonder if this is like the difference between an automatic and a manual car.
As an aside, I was manually implementing a similar mechanism in afk-code personally, and it is useful. Not just making it for fun, but documenting it publicly is important.
References
Discussion