Production AI agents have a well-known failure mode: they forget. An agent learns a workaround in one session and has no memory of it in the next, so the same job fails the same way, indefinitely. On May 6, Anthropic shipped three features for Claude Managed Agents — Dreaming, Outcomes, and Multiagent Orchestration — that collectively address what has been the largest gap in production agent infrastructure: agents that actually improve over time.
What Dreaming Actually Does
Dreaming is not a metaphor. It is a scheduled background process that reads an agent’s memory store alongside transcripts of its past sessions — up to 100 at a time — and produces a consolidated, reorganized memory store. Duplicates get merged. Contradictions get resolved. Patterns that appear across multiple sessions get surfaced as explicit knowledge.
The problem Dreaming solves is mundane but real: agents write to memory incrementally during sessions, and over time those stores accumulate noise — stale entries, conflicting facts, repeated observations. No single session can clean this up because no single session sees the full history. Dreaming runs asynchronously, outside any active session, and does the consolidation work that agents cannot do themselves.
Crucially, the input memory store is never modified. The output is a new store that can be reviewed and discarded before it lands. The Harvey legal AI team reported roughly a 6x increase in task completion rates after deploying Dreaming — primarily because their agents stopped forgetting the same filetype quirks and tool workarounds session after session.
Dreaming is currently in research preview and supports claude-opus-4-7 and claude-sonnet-4-6. Full technical documentation is available at the Dreams API reference.
Outcomes: Agents That Check Their Own Work
Outcomes is a rubric-based self-evaluation loop. A developer writes a rubric — what success looks like for a given task — and attaches it to a session. When the agent produces output, a separate grader evaluates it against the rubric in its own isolated context window. If the output does not meet the rubric, the grader specifies what needs to change, and the agent revises. The loop runs up to a developer-configured maximum number of iterations.
The critical design detail is the isolation. The grader cannot see the agent’s reasoning history; it only sees the output and the rubric. This separation is what makes the feedback credible — the agent cannot produce a plausible-sounding self-justification and move on.
Setting up an outcome requires two API calls:
session = client.beta.sessions.create(
agent=agent.id,
environment_id=environment.id,
title="Financial analysis"
)
client.beta.sessions.events.send(session.id, events=[{
"type": "user.define_outcome",
"description": "Research brief",
"rubric": "Every claim must be cited. Executive summary must be 150 words or fewer.",
"max_iterations": 3
}])
Anthropic’s internal benchmarks show 8.4% improvement in .docx quality and 10.1% in .pptx quality compared to standard prompting loops. Wisedocs, a medical document review company, cut review time by 50% while keeping output aligned with their internal quality standards. The Outcomes cookbook has a working citation-verification example.
Outcomes is in public beta with no waitlist. OpenAI has a similar concept with Codex’s /goal directive, but Outcomes uses an explicit rubric and a fully isolated grader — a more structured approach that trades flexibility for verifiability.
Multiagent Orchestration
When a job is too large or too varied for a single agent, Multiagent Orchestration lets a lead agent break the work into pieces and hand each piece to a specialist. Up to 20 subagents can run across up to 25 parallel threads. All agents share the same container filesystem; each runs in its own isolated context window, so one agent’s reasoning does not bleed into another’s.
Netflix’s platform team uses this to process build logs from hundreds of pipelines simultaneously — something that would be impractically slow as a sequential operation. The lead agent surfaces only the patterns worth acting on, rather than drowning in raw log volume.
Full observability is available through the Claude Console: which agent did what, in which order, and why. For teams that cannot afford black-box agent behavior in production, this matters. The multiagent sessions documentation covers the full API.
What You Can Use Today
All Managed Agents features are available on the Claude Platform API with the managed-agents-2026-04-01 beta header, which the SDK sets automatically.
- Outcomes — Public beta, no access request required
- Multiagent Orchestration — Public beta, no access request required
- Memory — Public beta, no access request required
- Webhooks — Public beta, no access request required
- Dreaming — Research preview (limited access)
The official quickstart walks through agent creation, environment setup, and session streaming in under 50 lines of Python.
Why This Is Infrastructure, Not a Feature
Individual agent improvements — better prompting, tool use, longer context — change what an agent can do in a single session. Dreaming, Outcomes, and Multiagent Orchestration are different in kind: they address how agents behave across sessions, over time, at scale. That is the gap between a demo and a production system.
Agents that improve between runs, that verify their own outputs against a rubric, and that decompose complex jobs into parallel workstreams are not just more capable — they are fundamentally more trustworthy. For teams moving agents out of prototypes and into real systems, that is the distinction that matters.













