Seeing what your agents are doing: the task registry problem
Why agent observability needs a runtime primitive, not a prompt — and how walrus solves it with a live task registry.
The previous post settled the planning question: plan mode is a prompt, not a runtime primitive. Plans belong in skills.
That left one problem open:
"When an agent dispatches a subagent, neither the parent agent nor the user has visibility into what the subagent is doing."
This post covers that problem and walrus's solution.
The problem
Modern coding agents regularly spawn multiple subagents — a research agent, a planning agent, several parallel implementation agents. Each runs its own context window, makes its own tool calls, and maintains its own state.
From the outside, they're black boxes.
Claude Code emits two hook events: SubagentStart and SubagentStop.
That's lifecycle signaling — you know a subagent started and stopped, but
not what it's working on, whether it's stuck, or how much context it's consumed.
GitHub issue #24537 — a request for an agent hierarchy dashboard — has been open and unanswered. The issue captures the real scenario precisely:
"The Claude Code conversation view was designed for a human talking to one agent. It was never meant to be a control plane for 7 concurrent subagents across multiple sessions."
A concrete example: you request a feature implementation across five files. Claude spawns three parallel subagents — Explore, Plan, Bash — each making 30+ tool calls over ten minutes. In the current interface, you see interleaved tool summaries with no way to answer: "Which agent is stuck?" or "How much context has each consumed?"
How current systems handle it
Agent Observability: Feature Coverage by System
Claude Code gets partial credit for live task tracking — the TodoWrite
convention lets an agent report what it's doing, but only within a single
agent context. A parent agent can't read a subagent's todo list. The hook
events (SubagentStart/SubagentStop) are lifecycle signals, not task
state. Community workarounds exist (one project pipes hooks to an HTTP
server, stores in SQLite, streams via WebSocket to a browser dashboard) but
nothing is built in.
Devin has good task visibility — a live progress list with the current step, the ability to redirect mid-task, and an approval gate before execution starts. But Devin is architecturally single-agent: the planning and implementation happen in the same agent context. There's no subagent hierarchy to expose.
Cursor runs background agents in Git worktrees with notifications on completion. Worktree isolation is clean. What's missing: a unified view of what each running agent is working on. You know agents are running; you don't know what they're doing.
OpenTelemetry's GenAI SIG is standardizing spans and traces for AI agent frameworks — useful for infrastructure teams, not helpful for a user asking "which subagent is stuck on this auth change."
Why "just use a prompt" doesn't work
Plan mode works as a prompt because it's intra-agent behavioral guidance: Claude is told "don't execute yet" and follows the instruction. The instruction and the agent it affects are the same entity.
Cross-agent visibility is different. A subagent can be instructed to call
update_task("researching auth flow"). But who reads that call? The parent
agent isn't watching the subagent's context window. There's no shared
channel unless the runtime provides one.
This is the core asymmetry:
- Prompts are intra-agent — they shape the behavior of the agent receiving them
- Registries are inter-agent — they create a shared channel that parent agents and users can observe
You can't prompt your way to cross-agent observability. The runtime has to provide the channel.
The walrus task registry
Walrus maintains an in-memory task registry as a first-class runtime primitive — a concurrent hash map that lives in the walrus process. Each entry records the agent's id, its parent, current status, a plain-English summary, and timestamps.
The agent API is a single call: update_task(id, status, summary). An
agent calls this when it starts work, when it completes a step, and when
it hands off to a subagent. The registry records the parent-child
relationship automatically from the call context.
walrus ps reads the registry and renders a live tree — agent id,
current summary, elapsed time, status — without any database round-trip.
The registry is session-scoped and ephemeral. It lives in memory for the
duration of the session, which means reads are microsecond-fast — no
database round-trip for a live view. When the session ends, completed tasks
flush to LanceDB as Episode nodes: durable, queryable, part of the
agent's history.
This gives you three time horizons:
- Now —
walrus ps(live, in-memory, instant) - This session — full task tree available while the session runs
- History —
walrus memory show --episodes(graph, queryable across sessions)
What this enables
Live intervention. When a subagent is running in a loop or working on
the wrong thing, the user can identify it from walrus ps and cancel or
redirect. Without the registry, the only option is to kill the whole session.
Session replay. After a complex multi-agent run, the task tree (flushed to the graph) is a structured record of what happened: which agent did what, in what order, and what the outcome was. This is more useful than scrolling through a conversation log.
Debugging stuck agents. Staleness is detectable from the last-updated
timestamp on each entry. walrus ps can surface warnings when an agent
hasn't reported progress in an unusual amount of time.
Cost attribution. Per-agent context usage (a future registry field) makes it possible to see which subagent consumed most of the session budget.
Open questions
Three design questions aren't settled yet and will be covered in follow-up posts:
Distributed sessions. If walrus runs multiple processes (client + server, or multiple workers), the in-memory registry doesn't span processes. This is a future problem — local single-process sessions are the target for now — but the design should not make it hard to add a shared registry later.
Approval gates. Should the task registry be the enforcement point for user approval before destructive actions? When an agent reports it's about to delete or modify something irreversible, should the runtime pause and prompt? This connects to the permissions and sandboxing design, which is a separate post.
Retention. How long to keep completed episode nodes in the graph? Forever is the obvious answer until you have a lot of sessions. Retention policy, pruning, and export are part of the memory management design.
Further reading
- Plans vs tasks: how AI agents think before they act — the previous post that introduced this problem
- Graph + vector hybrid memory for AI agents — the LanceDB + lance-graph design that backs the Episode flush
- Less code, more skills — walrus's design principle and the skills/runtime boundary
- Claude Code issue #24537 — the agent hierarchy dashboard feature request
- OpenTelemetry AI Agent Observability — the emerging standard for agent tracing infrastructure