iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🛳️

What are Claude Managed Agents? Updating my perspective on production-ready AI agents after CwC2026 London

に公開

Introduction

Recently, I have been watching the sessions from "Code with Claude" held in London, little by little every night in a calm atmosphere. Slowly digesting one or two videos a day with a cup of coffee has become my habit over the past few weeks.

As someone working as an AI Solutions Architect, official talks from Anthropic are undoubtedly a treasure trove of primary information. They reveal things that are not visible in product launch keynotes—what engineers on the ground actually thought when designing, where they got stuck, and what they learned—all told in the speakers' own words.

Among them, the session "Ship your first Managed Agent" by Isabella He (Member of Technical Staff, Anthropic Applied AI team) was the one that particularly helped organize my thoughts recently.

https://www.youtube.com/watch?v=19HDQ9HppOA

Session title slide "Ship your first Managed Agent" by Isabella He, Anthropic

This session is a hands-on guide that teaches how to set up a production-ready agent in the shortest possible time using Claude Managed Agents (CMA), a managed harness for which Anthropic released a public beta in April 2026.

Isabella He conducting a hands-on session at Code w/ Claude

I have been thinking about the productionalization of agents on a daily basis, and while listening to Isabella, there were many times I felt that "my internal organization finally became clear across several axes." I would like to record my takeaways in my own words.


Flow of this article

  1. The weight of the word "harness" — The evolution from Messages API to CMA and the philosophy behind it
  2. The three resources that make up CMA — Categorization as agents / environments / sessions
  3. Design decisions to decouple "brains" and "hands" — A move that simultaneously benefits security, latency, and fault tolerance
  4. Sessions speak in "events" — Moving beyond request/response
  5. The skin-deep feel through hands-on — The lightness of prompts and the effectiveness of streaming
  6. The landscape ahead — Advanced features like Sub-agents, Memory, Outcomes, and Vaults
  7. What I want to try starting tomorrow — One focused takeaway
  8. Conclusion — Re-applying the phrase "harnesses should evolve alongside your agents" to my own work

For reference, the session agenda was a bit simpler, consisting of three parts: Core concepts, Hands-on workshop, and Beyond the basics.

Session agenda slide: Three-part structure of Core concepts / Hands-on workshop / Beyond the basics

My article is based on that structure but divided into finer sections tailored to my field experience.


1. The weight of the word "harness" — The path from Messages API to CMA

Here, I will organize why the word "harness" is important, using the "history so far" that Isabella mentioned at the beginning as an entry point.

First of all, I personally use the term "harness" to refer to "a complete set of underpinnings when actually running a model"—that is, agent loops, tool execution, state management, retries, observation, scaling, and sandboxing. Anyone who has stepped into agent development must have struggled with the weight of this "underpinnings" at least once.

Isabella organized the programmatic usage of Claude into three stages: Messages API → Agent SDK → Claude Managed Agents. As I rearranged them in my head while listening, I felt this could be read as a history where "the range that developers have to build themselves is gradually being reduced."

During the session, a slide appeared showing these three stages side-by-side. It visually traces how the "YOU MANAGE" (range you do yourself) and "ANTHROPIC PROVIDES" (range Anthropic provides) change in each generation.

Evolution chart showing Messages API / Agent SDK / Managed Agents. A structure where the "range you manage" shrinks and the "range Anthropic provides" expands with each generation

Stage 1 (2023–): Messages API. This is the raw API, so-called "sending tokens and getting tokens back." Because it is primitive, the freedom is very high; however, the agent loop, tool execution, context management, compaction, observation, and scaling—all the underpinnings—had to be built by oneself. People who were running things that looked like agents during this period must have written a substantial amount of "code supporting the LLM."

Stage 2: Agent SDK. Here, the horizon expanded one step further. You get a powerful base where Claude Code is provided as a harness that can be called programmatically, allowing Claude to take actual actions on a computer or file system. While the SDK took on major parts of the agent loop and tool execution that were burdensome in the first stage, the responsibility for hosting, scaling, and sandbox management remained with the developers.

Stage 3: Claude Managed Agents (CMA). And this is the star of this time. It is the first harness that brings the entire underpinnings of agent execution, including the "hosting, scaling, and sandboxes" that remained in the second stage, into Anthropic's managed infrastructure. Developers can focus on the parts where domain knowledge and customizability are truly needed—task design, agent configuration, and custom tool logic.

Looking at them lined up like this, the structure becomes clear: "The 'code supporting the LLM' that I built myself during the Messages API era has been gradually absorbed by the harness side with each generation." I wasn't consciously using these three stages differently in my daily work. However, as someone who has witnessed the scenario of "having to completely rebuild the underpinnings of a PoC-built agent half a year later" several times, I had a strong sense of the problem of "maintenance cost of self-made harnesses." The sentence that hit hardest in this session faced this problem head-on.

"Harnesses should evolve alongside your agents."
— Isabella He

My thoughts here: This sentence hit me so hard I want to stick it on my desk for a while. Just by changing the model generation, prompts, tool designs, and completion logic that you thought were optimized in the past can lose their meaning entirely. She herself told a vivid episode about how she incorporated a workaround into the harness to deal with Sonnet 4.5's "context anxiety" (the behavior of finishing tasks early even when there is capacity), but when Opus 4.5 appeared, that behavior disappeared, making the workaround unnecessary. This is a scene that anyone in an organization with an in-house harness can encounter.

An analogy to grasp the overall structure

At this point, I would like to offer an analogy that supported the big picture in my head. When trying to understand CMA, I found it satisfying to think in terms of comparing "starting a restaurant as an individual from scratch vs. joining a franchise."

  • Messages API: An individual restaurant where you prepare the kitchen, kitchen equipment, and fixtures yourself, and handle everything including hygiene management, inventory management, POS introduction, and staff training. You have tremendous freedom, but there is an overwhelming amount of work to do before opening the store.
  • Agent SDK: A form where the supply network and some recipes are provided by the headquarters, but you work hard on store operation (choosing the location, interior, hall operation) yourself. It is much easier, but the responsibility for store operation lies with you.
  • CMA: A full package where the headquarters provides everything including the "store operation framework"—POS, logistics, refrigeration equipment, cleaning manuals, hygiene audits, and staff training. Owners can focus on "what makes their store unique—assortment, customer service policy, and local characteristics."

There are things omitted in this analogy. Just as franchises are constrained by the headquarters' intentions, CMA also has a side that depends on vendor-specific services. Even so, I feel this analogy is sufficiently effective to understand the way layers are divided regarding "who holds the underpinnings."

:::
💡 Summary of the chapter: Messages API → Agent SDK → CMA

  • The evolution of the programmatic usage of Claude is easy to understand if viewed as a flow from "high-freedom raw primitives" to "managed production-ready foundations."
  • Harnesses must be kept evolving alongside model generations, and the maintenance cost is heavier than imagined.
  • CMA is a service aimed at shifting that maintenance cost to Anthropic and allowing developers to focus on domain logic.
    :::

Isabella also brought up another impressive number.

Slide titled "The fastest way to build and ship transformative agentic products — 10–15× faster to production"

"We've seen people build 10 to 15 times faster to production with Claude Managed Agents by leveraging our purpose-built harness."
— Isabella He

"We have seen cases where the speed to production deployment has increased by 10 to 15 times by using the Claude Managed Agents purpose-built harness."

My thoughts here: This 10–15x figure is, of course, an observation by Anthropic and there is no guarantee it can be reproduced exactly in your own environment. However, when the "last mile of production" (hosting, scaling, fault-tolerant design, session persistence) is abstracted, the change in shipping speed is something I can understand on a visceral level. I myself have experienced a project turning into an "infrastructure project" the moment I started building an event-sourcing-based logging foundation while trying to take agent recoverability seriously. Remembering the pain of that time, I honestly feel that the option of offloading the harness layer to the vendor can indeed create a several-fold speed difference.

However, it is not yet clear whether this will apply directly to my own field. Choosing managed services is also a trade-off with freedom, reversibility, and data boundary constraints. I would like to touch upon this again in the latter part of this article.


2. The three resources that make up CMA — agents / environments / sessions

Here, I will organize the three most basic resources used when building something on top of CMA.

Isabella's explanation was as follows.

Three primary resources: A slide comparing the three: agents(/v1/agents), environments(/v1/environments), and sessions(/v1/sessions)

  1. agents — The place to define "persona and capabilities." It is the blueprint of the "brain" side that the agent can refer to within the agent loop, such as the model used, system prompts, tools, MCP servers utilized, and skills.
  2. environments — The "place" where the agent actually takes action. It is a physical (in terms of execution environment) space where the agent's "hands" move, consisting of a combination of containers and network settings.
  3. sessions — A unit that connects agents and environments. A session for "a specific agent instance on a specific environment" is launched, serving as a container for interactions with the user.

Here, it is easy to organize if I continue the analogy of the "franchise restaurant."

  • agents = Store concept and recipe collection: "What kind of store is ours, and what are we selling?"
  • environments = The kitchen and dining area where food is actually served.
  • sessions = The log of a series of interactions from entry to payment for each group of customers.

If I translate this more toward technology, in my field experience, it feels close to the correspondence in the Kubernetes world.

  • agents ≒ Deployment (Blueprint)
  • environments ≒ Pod (Actual entity)
  • sessions ≒ Log stream of interactions that occurred on that Pod

Thinking in terms of this correspondence, scaling designs such as "launching multiple environments from the same agent blueprint" or "running multiple sessions on the same environment" can be organized with good visibility. From an infrastructure engineer's perspective, this is a structure that is quite easy to handle even as material for multi-tenant design. Familiar patterns naturally hold true, such as "isolating environments per user to draw security boundaries" or "designing sessions to be disposable while keeping only logs."

Isabella included another important point. It is the premise, subtle but effective, that "the agent loop runs server-side" in the architecture.

Architecture: The agent loop runs server-side. "You send events. You stream events back. The loop runs reliably whether your client is connected or not."

"A key thing here, as I alluded to briefly before, claude managed agent has the agent loop run server side. This means that a lot of the complexities that come with managing hosting and scaling are abstracted away."
— Isabella He

"A key point here is that in Claude Managed Agents, the agent loop runs server-side. This means that a lot of the complexities that come with managing hosting and scaling are abstracted away."

My thoughts here: I interpreted the statement "the agent loop runs server-side" as a remark that, while simple enough to listen to without much thought, actually moves the premises of operational design quite significantly. Because up until now, we had to work hard to design "how to maintain user sessions on the web app side," and now an additional layer of "session persistence on the agent side" is being added. I understood that CMA has come to take care of this.

:::
💡 Chapter Summary: Roles of the three resources

  • agents = Blueprints (Model, Prompts, Tools)
  • environments = Execution containers (including network settings)
  • sessions = A single interaction occurring on an agent x environment = Event log
  • Since the agent loop runs server-side, the state is maintained even if you close your laptop
    :::

3. Separating the "brain" from the "hands" — The essence of design decisions

Here, I will organize the design decision at the core of the CMA architecture: the "separation of the agent loop (brain) and tool execution (hands)." For me, this was the part of the session that made me think the most.

Breaking down her explanation:

In many agent harnesses to date, the agent loop (brain) and tool execution (hands) were tightly coupled. This was the style where Claude would both do the thinking and trigger the tools within the same container. This design still makes sense in cases where you want the agent to behave freely on a file system, such as Claude Code.

However, depending on the type of agent, scenarios arise where you "want to separate the brain from the hands." For example:

  • From the perspective of credentials and security. If you separate the brain from the hands, you can design the agent so that it "only accesses actual credentials in an encrypted state."
  • If you want to clarify the sandbox boundary. By separating the brain side, the "layer with side effects" can be cleanly localized.
  • Latency. When they resided in the same container, the container had to be started every time a session was launched, which degraded TTFT (time to first token).

During the session, a slide titled "The brain left the box" was shown to illustrate this design decision. In the "BEFORE" on the left, the agent loop and tool execution were packed into the same box, and one was launched for every session. In the "NOW" on the right, the brain (agent loop) runs in a service managed by Anthropic, and the hands (sandbox) are only provisioned when tools are truly needed.

BEFORE vs NOW comparison. BEFORE was "ONE CONTAINER PER SESSION" with brain and hands in the same box. NOW is a structure where the brain is an Anthropic-managed resident service, and hands (Sandbox) are provisioned only when necessary

And Isabella mentioned an impressive number here.

"Our teams actually saw reductions in time to first token along the lines of over 90% reduction in TTFT for our P95 metrics on latency."
— Isabella He

"Our team has observed cases where TTFT was reduced by over 90% in P95 metrics for latency."

My thoughts here: I felt that this 90% figure also needs to be read with the proper context. It is merely an observed value under specific conditions within Anthropic, and it is premature to expect the same numbers in your own field. However, on the other hand, for engineers who have struggled with TTFT in serverless designs, I think those with that experience can intuitively nod along regarding the "effect of avoiding cold starts." Separating the brain from the hands removes the container startup cost of the hands from the critical path of session launch, which is a structurally effective improvement.

And for me, the most symbolic phrase in this session was the following.

"You essentially want to be able to decouple the hands from the brains of the agents."
— Isabella He

"Essentially, what you want to be able to do is decouple the agent's 'hands' from its 'brain'."

My thoughts here: I had a premonition that this phrase would be used as a slogan for agent design going forward. Since I have been a person who made "separation of concerns in microservices" a pillar of my work, I feel strong empathy for the idea of "loosely coupling LLM judgments from execution systems that carry side effects." Given that LLMs are becoming increasingly powerful, I believe designing side-effect boundaries will become even more important.

My organization bridging this with existing knowledge

If I take a step and align this with the concepts of the cloud/Kubernetes world I know, it becomes organized like this:

  • Separation of "brain" and "hands" ≒ Separation of control plane and data plane
  • Division of responsibilities between 'stateful services' and 'services with side effects' in microservices
  • Separation of logic layer and I/O layer in serverless design

In other words, I felt that the separation architecture presented by CMA is not an entirely new idea, but rather a careful application of design philosophies widely accepted in the world of distributed systems to the agent domain. There is not much need to be wary of new terms; rather, finding that one can pull tools from existing drawers, thinking "this is something I know," was a welcome discovery for me personally.

Of course, this is my way of taking it in, and I think there are other ways to organize it. For example, for those who value the perspective that "the agent's intelligence is better utilized if Claude is ensured the time to think while simultaneously triggering tools," there is still room to defend the merits of tight coupling. This is a point that can only be considered on a case-by-case basis depending on the use case.

:::
💡 Chapter Summary: Three perspectives where brain-hand separation is simultaneously effective

  • Security: Credentials and sandbox boundaries can be decoupled from the brain side
  • Latency: Startup of the hands-side container can be removed from the critical path of session launch (TTFT reduction)
  • Fault Tolerance: Even if the hands-side container breaks, the brain side stays alive separately, making it easier to resume the session
    :::

Furthermore, this scaling design for the "separation of brain and hands" is also explored from the same perspective in Anthropic's official engineering blog. It was written by Lance Martin, Gabe Cemaj, and Michael Cohen on the same day as the CMA announcement (April 8, 2026), and organizes the design decisions and operational considerations from a different angle than what was discussed in the hands-on session. If this section of the video resonated with you, reading it alongside will help you understand it three-dimensionally.

https://www.anthropic.com/engineering/managed-agents


4. Sessions speak in events — Moving beyond request/response

Here, I will organize how CMA represents sessions, a point that Isabella touched on lightly in the middle of the workshop but which proves significant later on.

CMA sessions do not move in a "tokens in, tokens out" model; they operate in units of events. Specifically:

  • User messages
  • Tool calls by the agent
  • Responses from the agent
  • (Internal) State transitions

These are all appended to the session log as events.

The slide "Sessions speak in events, not request/response" shown during the session conveyed this intuitively. It displays "Events you send" (user.message, user.custom_tool_result, etc.) on the left, and "Events you receive" (agent.message, agent.tool_use, session.status_idle, etc.) on the right, showing that the entire interaction with the agent is represented as a sequence of events with a {domain}.{action} naming convention.

Sessions speak in events, not request/response slide. Left: "Events you send" (user.message, user.custom_tool_result, user.tool_confirmation, user.interrupt) / Right: "Events you receive" (agent.message, agent.tool_use, agent.custom_tool_use, agent.mcp_tool_use, session.status_idle, session.error)

"The key portion here is that when our Claude Managed Agents runs within a single session, instead of responding in tokens in and tokens out, it actually works in units of events."
— Isabella He

"The key point here is that when our Claude Managed Agents run within a single session, instead of responding with tokens in and tokens out, they actually work in units of events."

My thoughts here: The idea that "sessions speak in events" overlapped almost entirely with Event Sourcing in my mind. In a request/response model, it is very difficult to "reproduce the state at a certain point in the past," "resume from halfway through," or "reproduce the same symptoms." However, with an event log model, you can reproduce past sessions by "replaying" them. This fundamentally raises the resolution for debugging, auditing, and failure analysis in production operations.

An analogy that works well here is "a replay of a recorded match." A request/response agent is like "asking a player a question on the field during a match"; once it is over, the information vanishes. On the other hand, an event-driven session is like "recording the entire match so you can rewind and watch it as many times as you like later." If there is a critical play, you can verify it again and again. In production agent operations, this difference is quite significant.

The "state" held by a session

She also touched on the fact that a session has four representative states:

  • idle (waiting)
  • running (executing)
  • rescheduling (scheduling retry)
  • terminated (ended)

It is important that these are not just a list of state names, but are explicitly defined as a state machine. The session starts at idle, transitions to running when a user.* event arrives, retries if a transient error occurs via rescheduling, and ends at terminated if unrecoverable — these transitions are handled completely by the harness.

Four session statuses state transition diagram. idle -> user.* event -> running -> (with transient error) rescheduling -> auto-retry back to running, or transition to terminated with an unrecoverable error

Because the harness takes care of these state transitions, integrations such as restarting sessions from external events using Webhooks or waking up agents in specific states can be realized without additional design costs. This means that agents are not just "triggered because a user spoke to them," but can "start moving on their own triggered by monitoring alerts or business events."

Organization based on my field experience

In my field experience, the three points I have been most troubled by in production agent operations are:

  1. Not knowing "when, what, and in what order things happened."
  2. Inability to reproduce the same symptoms.
  3. Inability to reconcile user subjective reports ("a strange answer came back") with actual behavior.

The fact that CMA is event-log-driven means that the resolution for these three points increases structurally. For infrastructure engineers, I understood this to mean that you can visualize the inside of an agent with the same feeling as "visualizing the inside of an app using distributed tracing or structured logging."

As a subtle but happy side effect, there is the nature that "the conversation remains in the cloud." Isabella's slide showed this — the session remains even after a hard refresh because you are not writing the database yourself. It is expressed via a simple API where client.beta.sessions.list(agent_id=...) retrieves a list of past sessions and client.beta.sessions.events.list(session_id) allows for a full replay.

The conversation lives in the cloud slide. Code example showing that all sessions can be retrieved with client.beta.sessions.list(agent_id=...) and a full replay can be retrieved with client.beta.sessions.events.list(session_id)

:::
💡 Chapter Summary: The power of session = event logs

  • User messages, tool calls, and agent responses are all eventized
  • States (idle/running/rescheduling/terminated) are automatically managed by the harness
  • By combining with Webhooks, agents can be started via external event triggers
  • The resolution for resuming, reproduction, auditing, and failure analysis is elevated by one level
    :::

5. Feeling the reality through hands-on experience

Here, I will organize what I "received with my body, not just through numbers or words" while watching the hands-on session that Isabella performed.

In the middle of the session, Isabella called out to the participants, "Open your laptops and head to the link in the repository," and moved into the main hands-on part.

Part two: Hands-on workshop. Slide showing "Laptops open — head to the link below." and the workshop link (cwc26.short.gy/workshops)

The setup itself was truly simple: git clone -> venv -> pip install -> add ANTHROPIC_API_KEY to .env -> streamlit run app.py and it was done. This alone brought up the prototype UI for an "incident response agent" in my local browser.

Hands-on setup commands. git clone, cd cwc-workshops/ship-your-first-managed-agent, python venv creation, pip install -r requirements.txt, ANTHROPIC_API_KEY configuration, and streamlit run app.py

The source code for this hands-on is publicly available in the ship-your-first-managed-agent directory of the official Anthropic GitHub repository anthropics/cwc-workshops. Since you can run it locally immediately as long as you have an ANTHROPIC_API_KEY, I encourage those interested to try it out on your own devices. If you launch the same screen while reading this article, the event stream and "brain-hand decoupling" described below will feel much more tangible.

https://github.com/anthropics/cwc-workshops/tree/main/ship-your-first-managed-agent

The flow of the hands-on was simple and structured to be completed by copying and pasting:

  1. Defining the agent — Specify Claude Opus 4.7, and provide the system prompt and tool definitions.
  2. Defining the environment — Set network allow lists on the Anthropic Cloud.
  3. Attaching logs (Files API) — Pass metrics and logs to the agent as files.
  4. Defining sessions and streaming — Link agent ID x environment ID x resources.
  5. Connecting local tools and session deletion — Link the actual logic of tools like get_metrics to the local environment.

Looking at this sequence, there are a few things I felt in my gut.

Gut feeling 1: Prompts can be simpler than you think

The system prompt Isabella showed was truly simple. It sufficed with something like "You are an SRE agent in charge of incident debugging. You can use tools such as get_metrics, get_recent_deploys, and get_diff."

"The agent is the one that defines the persona and the capabilities of the agent here, so it's the model, the system prompt, and the tools in our case for our agent here."
— Isabella He

My thoughts here: I have always been conscious of designing with the mindset that "if you pack too much into a prompt, it will hold you back when the model generation changes." However, honestly, that judgment was completed within my own head, and I had not fully organized it into a simple principle like "personas and prohibitions in the prompt, push all capabilities to the tools side." Seeing Isabella's hands-on, I felt that what I had been consciously trying to do was backed up by an example that said "this level of simplicity is fine," which helped further clarify my thoughts.

The quality of tool definitions (names, descriptions, and argument explanations) functioning as de facto "API documentation" was something I knew in my head, but seeing it move in the hands-on with a "simple prompt + polite tool definition" combination made me realize again that "this is where I should invest my time."

Gut feeling 2: Streaming is a part of "observability"

When you run the agent, events begin to flow from the moment the user sends a prompt. You can see in real-time on the UI what tools Claude is calling, what it is thinking, and how far it has progressed.

I felt that my previous understanding of streaming being just a "UX topic" was only half correct. In the CMA design, streaming is not only a UX topic but also a topic of observability. The same events flow into Anthropic's monitoring console, and both the user's experiential perspective and the operator's visualization perspective derive from the same event logs. I thought this was a clean design.

Here, what personally made me realize "I see" was the implementation of the mechanism where **"the agent on the cloud calls a function on your laptop.""

The cloud agent calls a function on your laptop slide. The cloud agent loop sends an agent.custom_tool_use event via an SSE stream, and the local script handles it with handle_tool(name, args) and returns a POST with user.custom_tool_result. Tool schema and handle_tool() code examples are also listed.

Normally, "calling a laptop from the cloud" would require inbound network access, but CMA solves this by having "the laptop keep an SSE stream open, and requests from the cloud flow backwards through that pipe." No additional network settings are required on the user side. Moreover, since the tool wire protocol is the same, the part that is done with json.load("data/metrics.json") for the demo can be used in production by simply replacing it with a Datadog SDK call.

And the fact that the implementation of this "open stream -> send message -> answer tool call on the spot" interaction was neatly contained in 12 lines was clearly demonstrated in the hands-on.

Open the stream, send the message, answer tool calls inline slide. Python code (12 lines) where you open a stream with client.beta.sessions.events.stream(), send user.message, and when agent.custom_tool_use arrives, process it with handle_tool() and send() the user.custom_tool_result.

"All of this is 12 lines" felt like the easiest way to express the "joy of managed services" in this session. If you tried to do it yourself, it would easily exceed 100 lines for connection maintenance, retries, error handling, state management, and so on.

Gut feeling 3: Choosing the word "oversight"

In the second half of the hands-on, Isabella shared this vision:

"I want you all to imagine the possibilities of where this can go if we give our agent more tools, more ability to take actions, access to your codebase, ability to put up PRs, ability to fix incidents, so that you as a human developer can just become the oversight and watch over the agents as they take action."
— Isabella He

My thoughts here: I felt significant meaning in her choice of the word "oversight." The point is that she expressed it as "oversight" rather than "automation" or "replacement." I interpreted it as showing a future where "an agent takes action while a human watches," rather than a future where an agent does whatever it wants on its own.

This might be similar to the debate over self-driving cars. Instead of jumping to full automation, I see it as a structure approaching "autopilot + air traffic controller," where humans oversee the whole space while machines take on individual operations. The agent moves its hands, and the human holds the major decisions and the overall direction.

In my own field, I have the impression that a design that positions the agent as a "powerful collaborator" rather than a "complete replacement" works better in operation. It feels realistic to gradually raise the line of "how much to automate" according to business risk — for example:

  • First, automate up to "cause identification + proposal," with humans doing the fixes.
  • Next, automate up to "creating fix patches," with humans doing the merges.
  • Some low-risk fixes (reverting configuration values, etc.) allow automatic merging.
  • Automate incident declaration, stakeholder notification, and post-mortem drafts.

It feels realistic to draw different lines for each business. However, this might still be an area where my own verbalization hasn't caught up. It's easy to say "take on an oversight role," but the actual operational flow, UI, and approval criteria for supporting it should vary greatly from organization to organization. I feel this is a theme where I will create words while trial-and-erroring in the field from now on.

Gut feeling 4: The "delete data" function called session deletion

Though subtle, the session deletion feature is something I personally felt "would be popular in the field."

"If you want to make sure that nothing is being retained for sessions that you don't want on the cloud or on your infrastructure, you can actually just come in and proactively manage how sessions are deleted, and once they're deleted they will be also removed from every single log aspect here."
— Isabella He

My thoughts here: Every vendor appeals that "you can retain data," but few services directly appeal that "you can delete data, and from all log aspects at that." In actual operation, cases where "being able to delete it from A to Z" are often more important, and this becomes quite significant in the context of enterprise adoption. Whether it's personal information, sensitive information, disclosure requests from users, or retention policies — there are areas where the story won't move forward unless "being able to delete it" is a prerequisite.

:::
💡 Chapter Summary: 4 gut feelings received from the hands-on

  • Prompts can be simple (push capabilities to the tool definition side)
  • Streaming is both UX and a part of observability
  • The division of roles between humans and agents is captured by the word "oversight"
  • "Being able to delete" is designed as a basic requirement for security and compliance
    :::

6. The scenery beyond — Sub-agents, Memory, Outcomes, and Vaults

Here, I will organize the advanced feature sets that exist beyond the "minimal primitives" of CMA. Isabella briefly introduced these in the second half of the session.

In the session, nine advanced features (all labeled with Beta) were neatly lined up on a single slide.

Beyond the basics slide. Nine features — Subagents, Memory, Outcomes, Vaults, MCP servers, Webhooks, Permission policies, Interrupt, and Console agent builder — are listed with icons and short descriptions. All labeled with Beta.

  • Sub-agents (Multi-agent architecture) — A mechanism where an orchestrator agent spins up the context of another agent to delegate tasks. Since each sub-agent has its own context window, it is effective for both context management and parallel processing.
  • Memory + Dreaming (Memory and "Dreaming" service) — A mechanism where Claude reflects on its own memory logs and decides for itself "what it should remember." Only truly important information, such as user preferences or correction history, is retained as long-term memory.
  • Outcomes — A mechanism where an agent's task is defined not as a "collection of tool calls" but as a "desired result (rubric)." The agent decides for itself which tools to call and how to execute them to achieve the final result.
  • Vaults (Credential management) — A mechanism to securely manage credentials in an encrypted form on a per-user or per-session basis. A design where the "brain" never sees the actual credentials is fundamentally established.
  • Webhook / Fine-grained Permissions / MCP server controls / Console Agent Builder — Peripheral equipment necessary for production operations, such as external event-driven triggers, fine-grained permission policies, new MCP server controls, and the Console Agent Builder.

Among these, what particularly impressed me was the design philosophy of Outcomes.

"With outcomes you can define a rubric of exactly what you want the agent to produce and it'll figure out along the way which tool calls and what it needs to do to execute towards that final result."
— Isabella He

My thoughts here: In my mind, this approach feels like a shift from "commanding with code" to "contracting for results." It can be said that the thinking behind cloud services, which guarantee results through SLAs and SLOs, has been brought into the agent domain. My take is that the style of writing out every detailed step in a prompt will gradually fade away in the long run (this is just my current prediction, and the reality in the field might differ).

The combination of Vaults and Webhook + Fine-grained Permission was also a part where I felt as an infrastructure engineer, "whether or not this exists will greatly change the operational design." Those who have experience building "secret management foundations (Vault, Secrets Manager, KMS, IAM roles...)" from scratch should intuitively understand the joy of Vaults. The decoupling design, where the agent only has access to the "results of using credentials" rather than the "credentials themselves," is very close to the concept of Zero Trust in principle.

Trying to build all of these advanced features from scratch would be a "fairly large project" for each. The fact that they are all included "in the box" carries significant strategic weight.

:::
💡 Chapter Summary: The direction of agent design indicated by advanced features

  • Sub-agents are a means to achieve "context splitting" and "parallel processing" simultaneously
  • Outcomes represent a paradigm shift from "describing procedures" to "describing achievement criteria"
  • Vaults are a Zero Trust-based decoupling design that prevents credentials from being visible to the "brain"
  • Features that would be massive projects to build from scratch are integrated into the harness

⚠️ On the other hand, these are not "magic that makes you smarter if you use them," but tools that "increase chaos if you don't design for them." If you invest in them, you must also estimate the design costs.
:::


7. What I want to try starting tomorrow

Having digested the video this far, there are many things I want to bring back to my work. Sub-agents, Outcomes, Vaults, Webhook-driven approaches — all look attractive.

However, if I try to test everything at once, it's clear that everything will end up half-baked. So, I will take the leap and focus on just one thing.

The one thing I want to try: Mapping existing agent implementations into "three layers"

Specifically, I will take the agent implementation I usually touch (or an agent project I'm designing with a client) and divide it into:

  • Agents layer (model, prompt, tool definitions = design of the "brain")
  • Environments layer (execution container, network boundary, credential boundary = where the "hands" move)
  • Sessions layer (event logs, state transitions, resume policy = history of interactions)

It is a small attempt to map these three on a single whiteboard.

While this is not a flashy output, I feel it has the effect of consciously realizing the "boundaries of responsibility" — "where do I hold it myself, and where can I rely on the managed service?" in a visible form. Once I do this, questions like the following will naturally emerge:

  • "Is the system truly capable of observing the sessions layer as event logs?"
  • "Does the credential management in the environments layer follow the principle of least privilege?"
  • "Are there any descriptions left in the agents layer prompt that should actually be pushed to the tools side?"

These are not big topics like "switching to CMA" or "throwing away all proprietary harnesses" at once, but rather a simple inventory check that can be started in an hour starting tomorrow. I feel that, in the end, such grounded inventory checks change my field more than flashy decisions.

And depending on the results of this inventory, it should become clear "what to try next." If observability is thin, establishing an event logging foundation will come first; if credential management is messy, it might lead to considering a mechanism equivalent to Vaults. I feel it is most natural to calculate the priority by working backwards from the current state of my own field.

However, to ensure that such an inventory check doesn't end as mere self-satisfaction for the "person doing it," I think it's important to do it together with clients and team members. If I can use the drawing as material for conversation, asking "where do you want to draw this boundary?" instead of closing it within my own head, the inventory check itself becomes organizational knowledge.

I don't think it's necessary to change everything at once, so I intend to start with this "three-layer mapping."

In fact, while writing this article, I actually wrote it out on a whiteboard myself.

Whiteboard for inventorying agent implementation in "three layers." A three-layer structure of agents "brain" / environments "hands" / sessions "history", a diagram showing the Agent loop running server-side underneath, and notes on the right side titled "What becomes visible through this diagram" and "Questions that emerge from the inventory."

What I realized while writing is that observation of the sessions layer is the thinnest in my field's agent implementation. Since I touch agents and environments on a daily basis, I had an internal model for them, but the perspective of placing sessions as an independent layer of "event logs + state transitions + resume policy" only became tangible after I mapped it out.

In one hour, anyone can probably write a diagram that corresponds to their field. Before standing up the same diagram in the mind of the reader, my own diagram was necessary first — that is what I felt.


Conclusion

At the end of the session, Isabella was reviewing with participants using a "What we did today" recap slide. It concluded with three points: (1) Gained a mental model (2) Shipped a working agent (3) Know where to go next.

What we did today: Recap slide. Three checkboxes are aligned: "Got the mental model," "Shipped a working agent," and "Know where to go next."

Following this recap, if I were to put into words what I was able to organize significantly within myself, it was the fact that "a service has appeared that is seriously trying to outsource the 'agent operational foundation' layer that I had been holding onto myself."

It's not about having to learn a mountain of new technical concepts. Rather, it was a session that reconfirmed that existing knowledge of distributed systems (event sourcing, control plane/data plane separation, Zero Trust, SRE practices) can be translated almost directly into the agent domain. I feel this is quite reassuring news as an infrastructure engineer and AI Solutions Architect.

And, raising the level of abstraction one step further and applying it to my own way of working, I was thinking about the following:

Isabella's sentence, "Harnesses should evolve alongside your agents," refers to the harness layer as a technical matter. However, when I apply this to my own mind, I feel it can also be interpreted as "my own internal organization (=my harness) must also be updated in line with the evolution of agents."

If you continue to use the mental model of "how to use LLMs" that you created during a PoC for six months, it will drift away from reality. Models change, harnesses change, and client expectations change. Therefore, it means that my own mental organization must also be treated as an "object to be updated" on a regular basis.

Digesting conference sessions little by little in the evening is truly such an "internal harness update work." The experience of carefully tracing each session certainly raises the resolution for the next session. Isabella's session this time was also one that re-organized my own "agent production operation" drawer a little bit.

Finally, I will cite the sentence that hit me the most one more time.

"Harnesses should evolve alongside your agents."
— Isabella He

I will take this home not only as a story about "the agent's running gear," but as a message that "I should continue to evolve my own internal frameworks alongside the subjects I handle."

I hope this blog serves as a small help for organizing the thoughts of someone who is thinking every day, "how should I design agent production operations?"


Reference Information

  • Video title: Ship your first Managed Agent
  • Video URL: https://www.youtube.com/watch?v=19HDQ9HppOA
  • Speaker: Isabella He (Anthropic, Applied AI team, Member of Technical Staff)
  • Conference: Code with Claude (Held in London)
  • This article is a summary of my personal impressions after watching the above publicly available video. It does not represent the official views of Anthropic.
  • Numerical values mentioned in the video (10-15x faster to production, over 90% reduction in P95 TTFT, etc.) are observational values under specific internal Anthropic conditions. Please be aware that you may not get the same results in your own company environment.
  • Product specifications and feature names mentioned in this article (Bring Your Own Container / Compute, Claude MCP Tunnels, Dreaming, Outcomes, Vaults, etc.) are current as of the time the video was recorded. Please be sure to check the latest specifications in the official Anthropic documentation.

Discussion