Plans vs tasks: how AI agents think before they act
A survey of plan-then-execute patterns across Claude Code, Cursor, Devin, Windsurf, and Copilot — and what it means for autonomous agent design.
Every AI agent faces the same problem: given an open-ended goal, how do you avoid charging ahead in the wrong direction?
The answer most production systems have converged on: separate planning from execution. Analyze first, act second. Make the plan visible and editable before committing to it. This turns out to be more than a UX nicety — it's the difference between an agent that's useful on complex tasks and one that confidently does the wrong thing.
We surveyed how five major coding agents implement this separation — Claude Code, Cursor, Devin, Windsurf, and GitHub Copilot — and what the emerging patterns mean for how walrus should think about plans and tasks.
Why planning matters
The naive agent loop is: receive a goal → take action → repeat until done. This works for simple tasks. It fails badly on anything requiring multi-step coordination — the agent makes irreversible edits early, paints itself into corners, or misunderstands scope and rewrites the wrong thing.
Research on SWE-bench shows this concretely. Refact.ai's top-ranked
approach
includes an explicit deep_analysis() reasoning step before applying
changes. Their workflow:
- Describe the problem
- Investigate the repo
- Create and run a problem reproduction script
- Make a plan, then apply changes
- Run tests, evaluate, repeat
The planning step isn't decorative — it's how they hit 74.4% on
SWE-bench Verified. And interestingly, they found that removing a
separate strategic_planning() tool powered by o3 actually improved
results once they upgraded to Claude 4 Sonnet: the frontier model
handles planning as part of its reasoning, rather than as a separate
explicit step.
This points to something important: planning doesn't always need to be a separate mode. It needs to happen, but where it lives in the architecture varies.
How five systems handle planning
Planning Capability Coverage by System
Claude Code: plan mode + TodoWrite
Claude Code has the most explicit plan-execute separation of any system we surveyed. It ships two mechanisms:
Plan mode (activated with /plan or Shift+Tab twice) is a
read-only operating phase
where Claude can only observe, analyze, and write to a plan file —
no edits, no shell commands. The plan is written to a markdown file in
~/.claude/plans/. The user can open it with Ctrl+G, edit it, remove
steps they don't want, and then approve. Claude exits plan mode and
implements exactly what was agreed.
What's notable about this design: Claude Code's creator Boris Cherny uses it himself — start in plan mode, iterate until the plan is right, then switch to auto-accept for execution. The plan mode is fast: since Claude isn't running tools or writing files, responses are much quicker and cheaper.
TodoWrite is the execution-side complement. During implementation, Claude maintains a structured task list — pending, in-progress, completed. It marks tasks done immediately as they finish, with exactly one task in-progress at a time. The todo list is visible to the user throughout execution, providing a live view of what's happening and what's left.
The two mechanisms serve different phases:
Plan mode also has a subagent model — specialized agents (Plan, Explore, Task) that can be launched inside a session. The Plan agent is constrained to research tools only. The Task agent can use all tools. This mirrors the plan-execute split at the agent level, not just the session level.
Cursor: plan mode + background agents + automations
Cursor's architecture has evolved toward parallel, autonomous execution with planning as a first step.
Agent plan mode lets the AI write a detailed Markdown plan before touching any code. PMs and engineers can review, edit inline, or store plans as reusable templates. The workflow: describe the task → agent produces a plan → user approves step-by-step → execution.
Background agents take this further. You can push an agent run to the background while you keep coding — the agent works asynchronously, notifies you on completion or when it needs approval. Multiple agents can run in parallel on different tasks. Linear integration lets you start agent runs directly from issue workflows.
Automations (announced March 2026) go further still: agents triggered by events — a new commit, a Slack message, a PagerDuty incident, a timer. Cursor estimates it runs hundreds of automations per hour. An incident arrives in PagerDuty, an agent queries server logs via MCP, investigates, proposes a fix.
The pattern: planning is the human checkpoint before autonomous execution. After approval, the agent runs without intervention until it needs another decision.
Devin: upfront planning with continuous revision
Devin's approach is the most human-workflow-aligned. When you provide a task, Devin:
- Inspects the repository
- Returns a step-by-step plan in seconds
- Waits for you to modify it before proceeding
The Devin 2.0 architecture makes plan revision central — "the plan changes a lot over time." This isn't a failure mode, it's the design. As Devin investigates, discovers constraints, and runs into dead ends, it updates the plan. The user can see and redirect at any point.
Devin also runs a separate review agent that pressure-tests the implementation after the writing agent finishes. One agent writes, another critiques. The review agent can trigger another round of fixes — a closed loop that doesn't require user input unless it gets stuck.
Windsurf: three modes with megaplan
Windsurf's Cascade has three distinct modes: Ask (conversation), Code (execution), and Plan (planning only).
Plan mode produces a structured implementation plan before any code is
written. The megaplan command triggers an advanced variant that asks
clarifying questions before generating a more comprehensive plan —
useful for large, ambiguous tasks where the agent needs to reduce
uncertainty before proposing an approach.
Wave 13 added parallel multi-agent sessions with Git worktrees and side-by-side Cascade panes. Multiple plans can execute simultaneously in isolated branches.
GitHub Copilot Workspace: plan as the entry point
GitHub Copilot Workspace makes planning the primary interface. You don't start by describing code changes — you start with an issue or goal, and Copilot generates a plan: which files to touch, what to change, why. You edit the plan directly before any code is generated.
The plan is the artifact. Code generation is downstream of it.
This is the most explicit "plan is a user-editable document" design in the survey — but reviews note that Copilot's planning remains shallower than dedicated agent systems: it sometimes abandons plans mid-execution or generates plans that don't reflect the actual implementation complexity.
Patterns across systems
The radar above shows capability coverage. This chart shows when in the workflow each system allows planning to happen — pre-execution only, mid-execution, or post-write:
Where Each System Places the Plan/Execute Boundary
Five patterns appear consistently across all systems:
1. Plan before executing, not during. Every system separates the analysis phase from the action phase. The plan is generated, reviewed, and approved before any files are touched. This isn't just a UX pattern — it reduces irreversible errors and aligns the agent's understanding with the user's intent before the costly part starts.
2. Plans are visible and editable. Opaque planning that the user can't inspect or modify produces anxiety and distrust. Every system that succeeded with developers (Devin, Claude Code, Cursor) makes the plan an artifact you can read and modify. The agent is a collaborator proposing a plan, not a black box executing one.
3. Task tracking during execution. Plans decompose into tasks. Tasks are tracked with status (pending / in-progress / done). The user can see where execution is at any moment. This matters for long tasks — without it, the agent feels like a black box even when it's working correctly.
4. Approval gates. Users approve the plan before execution begins. Some systems (Devin) also checkpoint at ambiguous decision points during execution. The key insight: approval gates are not friction — they're the mechanism that makes autonomous execution feel safe enough to allow.
5. Plan revision as a feature, not a failure. Devin's explicit position that "the plan changes a lot over time" reflects a mature understanding of software tasks. Plans made with incomplete information need to evolve. Systems that treat the initial plan as fixed become brittle.
The academic framing
This pattern has a name in the research literature: plan-then-execute agents, sometimes called HTN (Hierarchical Task Network) planners applied to LLMs.
LangChain's plan-and-execute agent design formalizes this for harder tasks: a planner LLM generates a full task list, an executor LLM works through each task, and the planner can revise based on execution feedback. The separation of planner and executor allows each to be tuned independently — the planner optimized for decomposition quality, the executor for reliable task completion.
Recent work on SWE-bench Pro (long-horizon software engineering tasks) shows that planning quality is the primary bottleneck for agents on complex multi-session tasks — not execution ability. Agents that can generate accurate plans for multi-day tasks dramatically outperform reactive agents on the same tasks.
The flip side: Refact.ai's SWE-bench findings
show that for well-scoped single-issue tasks, frontier models can
internalize planning as part of reasoning — a separate strategic_planning()
step adds latency without adding quality. The right architecture
depends on task horizon and ambiguity.
The hidden truth: plan mode is a prompt
Before drawing conclusions for walrus, there's a finding worth surfacing directly. Armin Ronacher reverse-engineered Claude Code's plan mode and found:
"It is in fact just a rather short predefined prompt that enters plan mode. The tool to enter or exit plan mode is always available."
There is no runtime enforcement. No tool restrictions. No locked-down execution context. Claude Code's plan mode is a system prompt injection that says "do not execute yet" — and the model follows it because it's instructed to, not because the runtime enforces it.
This is confirmed by a GitHub issue requesting skill-based plan mode customization: users discovered that planning behavior can be fully replicated with a slash command that injects the right prompt. The magic is linguistic, not architectural.
The same is true for TodoWrite. The model marks tasks done because it's instructed to follow that convention — not because the runtime tracks task state.
What this means for walrus
This finding reshapes the architecture question. Planning behavior doesn't need runtime primitives — it needs good skills.
Plans are prompts. They belong in skills.
A planning skill encodes: write a plan file before acting on complex
tasks, ask for approval before executing destructive changes, update
the plan as you learn more. This is pure behavioral instruction —
the same thing Claude Code does, but as a shareable, community-maintained
skill rather than a baked-in mode. Any walrus user can install it,
modify it, or replace it with their own variant.
This fits less code, more skills exactly. The planning behavior that every team has different opinions about — how verbose the plan should be, when to ask for approval, how to format task lists — is precisely the kind of thing that doesn't belong hardcoded in a runtime.
The open question is observability.
Plan mode being a prompt settles the planning side. But it surfaces a
harder problem: when an agent dispatches a subagent, neither the parent
agent nor the user has visibility into what the subagent is doing.
Claude Code emits SubagentStart/SubagentStop hook events —
lifecycle signals only. There is no structured "what is this agent
working on right now" signal. The
feature request for an agent hierarchy dashboard
is open and unanswered.
That's the problem worth solving at the runtime level — not plan mode. We'll cover the design in a follow-up post.
Further reading
- What is Claude Code's plan mode? — Armin Ronacher's analysis of how plan mode actually works
- Cursor Automations — TechCrunch on Cursor's event-driven agent system
- Devin 2.0 — Cognition's plan-revise loop
- Refact.ai SWE-bench — when explicit planning helps vs. when it doesn't
- Less code, more skills — walrus's design principle
- How AI agents remember — our memory survey