Why multi-agent workflows fail in production
A survey of real coordination failures in Claude Code, Cursor, Devin, and the broader ecosystem — and what they reveal about agent architecture.
Multi-agent sounds like the obvious answer: parallelize work, specialize agents, go faster. And for demos, it works — you can show three agents collaborating on a feature and it looks impressive.
In production, the failures are consistent enough that Cognition — the team behind Devin — published a post titled Don't Build Multi-Agents. The GitHub blog ran Multi-agent workflows often fail. Here's how to engineer ones that don't.
These aren't fringe complaints. They're structural.
Context doesn't travel
The foundational problem: each subagent starts fresh. The only information that passes between agents is the task prompt string. Everything the parent agent discovered — the codebase structure, constraints, decisions already made — has to be re-communicated explicitly or re-discovered from scratch.
The Claude Code docs acknowledge this directly:
"Subagents might miss the strategic goal or important constraints known to the parent agent, leading to solutions that are technically correct but not perfectly aligned with the user's original intent."
In practice this plays out as "context amnesia." One documented case: a user asked Claude Code to fix failing tests and it repeatedly spawned subagents for work that could have been done in the main context — burning through tokens with no benefit because each subagent re-explored files the parent already understood. GitHub issue #11712 captures a related failure: when agents are resumed, they lose the user prompt that initiated the resumption, so the resumed agent lacks the context that explains why it exists.
The community workaround is "Main Agent as Project Manager with State Awareness": the parent agent maintains a shared context document and explicitly passes relevant state to each subagent's prompt. This works, but it's manual prompt engineering — the developer is doing the coordination work that the system should handle.
Parallel agents conflict
When agents run in parallel, they make independent decisions about shared state. Cognition's analysis makes the problem concrete:
"If a task is 'build a Flappy Bird clone' divided into subtasks, one subagent might build a Super Mario Bros. background while another builds an incompatible bird, leaving the final agent to combine these miscommunications."
The GitHub Blog identifies the systemic version of this:
"Agents may close issues that other agents just opened, or ship changes that fail downstream checks they didn't know existed, because agents make implicit assumptions about state, ordering, and validation without explicit instructions."
The failure mode compounds. From Towards Data Science:
"When one agent decides something incorrectly, downstream agents assume it's true, and by discovery time, 10 downstream decisions are built on that error."
This is why Devin avoids parallel agents entirely. It's not a capability limitation — it's an architectural choice based on the failure modes.
Cost and latency explode
Multi-agent token consumption doesn't scale linearly. The GitHub Blog documents the production gap:
- 3-agent workflows that cost $5–50 in demos reach $18,000–90,000/month at scale
- Response times jump from 1–3 seconds to 10–40 seconds per request
- Reliability drops from 95–98% in pilots to 80–87% under production load
The underlying cause: every inter-agent handoff requires token-intensive context reconstruction. The parent encodes its state into a prompt; the subagent re-processes the entire relevant context from scratch. Multiplied across many agents and many calls, the token budget explodes.
Cursor's background agents add a different dimension: cloud environment reliability.
User-reported failures include Docker builds failing during apt-get update, git branch
push failures, connection dropouts that stall agents mid-task, and cloud environment
initialization errors. The compute is remote and shared, so failures that don't exist
locally appear at scale.
Where each system struggles
Multi-Agent Coordination: Where Each System Struggles
The chart reflects the research above. Claude Code is strong on environment reliability (local execution) but has no mechanism for context continuity or parallel conflict handling. Cursor partially addresses parallelism through Git worktrees but has the opposite reliability profile — cloud execution introduces environment failures. Devin avoids parallel agents entirely and invests heavily in error recovery through its review agent, which is why it scores high on those axes but zero on parallel conflict handling.
No system in the current survey scores well across all five dimensions. Context continuity is the universal weak spot.
Why better models don't fix this
The 2026 AI Agent Report is direct:
"Most multi-agent failures aren't caused by weak models — they're caused by weak reasoning architecture. Orchestrating multiple agents with divergent goals, conflicting information, and cascading failures requires architectural discipline."
Code quality compounds the issue. A January 2026 Stack Overflow Blog analysis found that AI-generated code includes bugs at 1.5–2x the rate of human-written code when supervision gaps exist, with 3x the readability issues. Multi-agent workflows create supervision gaps by design — no single reviewer sees the whole picture.
The integration layer is where failures originate: how agents hand off state, coordinate writes, report progress, and signal when they're stuck. Models are getting better; orchestration architecture largely isn't.
What the research says works
The GitHub Blog identifies several patterns that prevent the most common failures:
Typed schemas for inter-agent messages. Without explicit contracts between agents, every handoff is a natural language interpretation problem. Typed schemas eliminate a class of coordination errors before they happen.
Explicit handoff contracts. The orchestrator maintains state; workers are stateless and only know what the orchestrator tells them per-invocation. This is the "Main Agent as Project Manager" pattern formalized. It's more overhead to design but dramatically reduces inter-agent confusion.
Budget meters and permission gates. Catching runaway token consumption before it becomes a $90,000 surprise requires active monitoring. Permission gates before destructive or expensive operations give the system a chance to pause.
Observable task state. When agents can report their current status to a shared registry — not just to their own context — the orchestrator and user can see what's happening and intervene. This is the problem the task registry design addresses.
Checkpointing over re-discovery. Explicit handoff documents (a structured summary of what's been done, what constraints apply, what decisions have been made) reduce context amnesia. The cost of writing a handoff document is cheaper than the cost of a subagent re-exploring the same territory.
Further reading
- Don't Build Multi-Agents — Cognition's case for single-agent architecture
- Multi-agent workflows often fail. Here's how to engineer ones that don't. — GitHub Blog's structural analysis
- Why Your Multi-Agent System is Failing: Escaping the 17x Error Trap — cascading decision error analysis
- How I Solved Context Amnesia in Claude Code — community workaround for context continuity
- Seeing what your agents are doing: the task registry problem — how walrus addresses observable task state
- Plans vs tasks: how AI agents think before they act — the planning side of multi-agent coordination