Context compaction in agent frameworks

We surveyed eight frameworks to understand who compacts what in multi-agent systems. The answer: each agent handles its own, and nobody coordinates.

Every agent session eventually hits the wall. A 200K-token context window sounds infinite until your agent has read 30 files, called 50 tools, and produced 15 rounds of back-and-forth. At that point, you need a strategy: summarize the old stuff, drop it, or crash.

But the harder question emerges in multi-agent systems. When a leader agent delegates to three workers, each burning through context at different rates — who decides when to compact? Does each worker manage its own context, or does the leader compact on behalf of its team? Does anyone coordinate the timing?

We surveyed eight frameworks — Claude Code, OpenAI Agents SDK, LangChain/LangGraph, CrewAI, AutoGen, Cursor, Aider, and Google ADK — to find out. The answer: each agent compacts itself, independently, with no coordination. One framework (AutoGen) offers centralized compaction. Nobody else has tried.

Context Compaction Capabilities by Framework (0-10 scale)

The four compaction strategies

Summarization

An LLM reads the conversation history and produces a condensed summary. The summary replaces older messages, preserving semantic content while reducing token count. Six of eight frameworks use this as their primary strategy. The risk: summaries are lossy. Factory.ai's evaluation of 36,611 production messages found all compaction methods scored 2.19–2.45 out of 5.0 on artifact tracking — remembering which files were modified is uniformly weak.

Truncation

Drop older messages entirely. Keep the most recent N messages or N tokens. Simple, fast, zero LLM cost — but no continuity. If the answer to the current question depends on something said 50 messages ago, it's gone.

Sliding window

Keep a fixed window of recent events, with deliberate overlap between windows. Similar to truncation but the overlap maintains continuity across boundaries. Google ADK's event-based compaction uses this approach.

Encrypted compaction

OpenAI's unique approach. The compacted state is an opaque, encrypted blob that the model can interpret but developers cannot inspect. It may preserve more information than text summaries, but it's completely unverifiable.

Compaction Strategy Profile by Framework (0-10 scale)

The multi-agent question

The central design question in multi-agent compaction isn't how to compact — it's who compacts whose context. Three patterns are possible:

Pattern 1: Context isolation with summary return. Each agent has its own context window. When a sub-agent finishes, only its final summary returns to the parent. The parent never sees the child's full context, so the question of "who compacts the child" is moot. This is the dominant pattern — used by Claude Code, OpenAI, LangGraph, CrewAI, Google ADK, and Manus.

Pattern 2: Delegated compaction. A group manager compresses the shared conversation and broadcasts the compressed version to all agents. AutoGen's CompressibleGroupManager is the only production implementation. The SupervisorAgent pattern from academic research implements a variant: "adaptive observation purification" where a supervisor refines long observations before they reach worker agents, achieving 29.68% average token reduction.

Pattern 3: Per-agent independent compaction. Each agent monitors its own token usage and compacts when it hits a threshold. There's no coordination — agents compact at different times, with different strategies, producing different summaries of overlapping context. This is what every framework defaults to for within-session management.

Anthropic's context engineering guide articulates the design principle behind Pattern 1: "Share memory by communicating, don't communicate by sharing memory." Each agent owns its context. Communication happens through structured, already-compressed summaries. The parent never holds the child's raw context to compact in the first place.

Framework by framework

Claude Code — server-side summarization with custom prompts

Claude Code has the most mature compaction system of any coding agent. It operates at two levels: the API-level compaction and the client-side auto-compact behavior.

API-level compaction. When input tokens exceed a configurable trigger threshold (default: 150,000 tokens, minimum: 50,000), Claude generates a summary and creates a compaction block. On subsequent requests, all message blocks prior to the compaction block are dropped. The default prompt asks for "state, next steps, learnings" — replaceable via a custom instructions parameter. A pause_after_compaction option returns the compaction block with a "compaction" stop reason, letting you inspect and inject context before continuation.

Claude Code auto-compact. The CLI triggers at approximately 95% context capacity. The /compact command triggers manually. Compaction is instant since v2.0.64. What survives: user requests, key code changes, architectural decisions. What often gets lost: detailed early instructions, intermediate tool outputs. Mitigation: put persistent rules in CLAUDE.md — it's re-injected every turn regardless of compaction.

Multi-agent. Each subagent gets its own context window. Compaction is invisible to the parent — it receives only the final result. Background agents use "delta summarization," producing 1–2 sentence incremental updates rather than reprocessing full context. Sub-agent transcripts persist in separate files, unaffected by main conversation compaction.

Setting	Value
Trigger	~95% context (auto) or `/compact` (manual)
Strategy	LLM summarization
Configurable	Trigger threshold, custom instructions, pause behavior
Multi-agent	Per-agent independent, summary return

OpenAI Agents SDK — truncation and encrypted compaction

OpenAI provides two mechanisms: truncation and a dedicated compaction endpoint.

Truncation. Setting truncation: "auto" drops input items from the middle of the conversation to fit the context window. Pure truncation — no summarization, no semantic preservation. Disabled by default (fails with 400 on overflow).

Compaction endpoint. The /responses/compact endpoint performs loss-aware compression, returning an encrypted, opaque compaction item. All prior user messages stay verbatim. Prior assistant messages, tool calls, and reasoning are replaced with the encrypted blob. The model can interpret it; developers cannot. compact_threshold controls automatic triggering.

Codex specifically was the first model natively trained to operate across multiple context windows through compaction — compaction is part of its training, not just a post-hoc technique.

Multi-agent. The SDK uses a RunContextWrapper sharing state across agents in a single run. Context management is configured at the API level, not per-agent. Sub-agents return summaries to keep the orchestrator's context clean.

Setting	Value
Trigger	Token threshold (configurable) or manual
Strategy	Truncation or encrypted compaction
Configurable	`truncation`, `compact_threshold`
Multi-agent	Shared context wrapper, summary return

LangChain/LangGraph — composable primitives

LangGraph provides the most modular context management. Rather than a single strategy, it offers composable primitives that developers assemble.

Trimming. The trim_messages utility, applied via @before_model middleware, counts tokens and removes older messages while preserving the first message for context.

Summarization. The SummarizationMiddleware replaces culled messages with LLM-generated summaries. Configuration includes trigger (e.g., ("tokens", 4000)), keep (e.g., ("messages", 20)), and model (which model summarizes).

Message deletion. RemoveMessage with the add_messages reducer enables surgical removal. A caveat: you must ensure the resulting history remains valid per provider requirements (alternating roles for OpenAI).

Custom middleware. @before_model and @after_model decorators let you build arbitrary transforms — hierarchical summarization, topic-aware compaction, or anything else.

LangGraph's context engineering philosophy groups strategies into four buckets: write, select, compress, and isolate. Large tool results (>20,000 tokens) are offloaded to the filesystem and replaced with path references — a form of compaction that avoids summarization entirely.

Multi-agent. Each agent manages its own state via checkpointer. Sub-agents are explicitly framed as "context quarantine" — the parent receives only the final result, not the tool calls that produced it.

Setting	Value
Trigger	Developer-configured per middleware
Strategy	Trim, summarize, delete, or custom
Configurable	Highly — composable primitives
Multi-agent	Per-agent via checkpointer, summary return

CrewAI — automatic summarization on overflow

CrewAI takes the most opinionated approach: a single boolean.

respect_context_window. When True (default), CrewAI monitors each agent's conversation against the LLM's context limit. On overflow, summarize_messages() splits the conversation into chunks, summarizes each via LLM, and replaces originals with a single summary. When False, overflow crashes.

No trigger threshold. No custom prompt. No summarization model choice. On or off.

CrewAI's memory system operates separately. After each task, discrete facts are extracted and stored in ChromaDB. Before each task, relevant context is recalled and injected. This supplements but doesn't replace within-session compaction.

Multi-agent. Each agent detects and handles its own overflow. The crew orchestrator doesn't coordinate.

Setting	Value
Trigger	Automatic on overflow
Strategy	LLM summarization (chunked)
Configurable	`respect_context_window: true/false` only
Multi-agent	Per-agent independent

AutoGen — the exception: centralized compaction

AutoGen is the only framework that implements genuine delegated compaction. The rest handle their own context; AutoGen's CompressibleGroupManager manages it for the group.

AutoGen 0.2 transforms. Three built-in transforms compose via TransformMessages:

MessageHistoryLimiter(max_messages=N) — keep the last N messages
MessageTokenLimiter(max_tokens=N) — truncate context to token budgets
LLMLingua compression — semantic text compression preserving meaning

AutoGen 0.4 context types. The ChatCompletionContext hierarchy:

BufferedChatCompletionContext(buffer_size=N) — most-recent-used
HeadAndTailChatCompletionContext(head_size=N, tail_size=M) — first N + last M, preserving initial instructions
TokenLimitedChatCompletionContext — token budget (experimental)

The group manager pattern. In GroupChat, all agents subscribe to a shared message topic. The CompressibleGroupManager compresses this shared stream and broadcasts the compressed version to all participants. This is fundamentally different from per-agent compaction — a single authority decides when and how, then distributes one compressed view.

Multi-agent. Two modes: per-agent (transforms attach individually) or centralized (group manager compresses for all). The centralized mode is possible because AutoGen's group chat inherently shares a single message stream — making coordinated compression both possible and necessary.

Note: Microsoft unified AutoGen and Semantic Kernel into the Microsoft Agent Framework in late 2025.

Setting	Value
Trigger	Configurable (message count, token count, custom)
Strategy	Truncation, LLMLingua compression, or custom
Configurable	Highly — pluggable transforms and context types
Multi-agent	Per-agent or centralized (CompressibleGroupManager)

Cursor — flash-model summarization

Cursor handles compaction at two levels: automatic chat summarization and file condensation.

Auto-summarization. When approaching the context limit, Cursor summarizes using a smaller "flash" model — not your current model. Fast and cheap, but quality depends on the flash model's ability to capture nuance.

Manual triggers. /summarize or /compress (added in v1.6) trigger on demand.

File condensation. Separately, large files may be condensed to signatures and structure rather than included verbatim.

Limitations. No threshold tuning, no custom prompt, no option to disable auto-summarization. Users report context loss in long debugging sessions. Cursor's recommendation: start new chats for separate tasks.

Multi-agent. N/A — single agent per chat session.

Setting	Value
Trigger	Automatic near context limit
Strategy	LLM summarization (flash model)
Configurable	Manual trigger only
Multi-agent	N/A

Aider — recursive summarization in a background thread

Aider's ChatSummary class takes a distinctive approach: recursive summarization using a cheap model, running in a background thread.

How it works. When chat history exceeds max_chat_history_tokens (model-dependent defaults), aider recursively breaks history into chunks and summarizes each. Recursion continues until the summary fits. A configurable "weak model" handles summarization — e.g., GPT-4o-mini while coding with Claude Opus. The --weak-model flag overrides this.

Context partitioning. Rather than one context window, aider splits into regions: system prompt, repo map (tree-sitter-based codebase summary), chat history (subject to summarization), and active files (full content). Users /drop inactive files to manage context manually.

Multi-agent. Aider's architect/editor pair shares chat history. Compaction is managed centrally — not per-role. This is a mild form of shared compaction, though with only two agents it's closer to a single pipeline than true coordination.

Setting	Value
Trigger	Soft token limit (`max_chat_history_tokens`)
Strategy	Recursive LLM summarization (background thread)
Configurable	`--max-chat-history-tokens`, `--weak-model`, `--map-tokens`
Multi-agent	Shared history, central compaction

Google ADK — event-based sliding window

Google ADK takes the most structured approach: event-based compaction with a sliding window and configurable overlap.

How it works. ADK tracks workflow events within a session. When completed events reach compaction_interval, an asynchronous LLM summarization of older events runs over a sliding window. The summary is written back as a new event.

Overlap. The overlap_size parameter controls how many previously-compacted events re-enter the next window:

Event 3: events 1–3 summarized
Event 6: events 3–6 summarized (event 3 overlaps)
Event 9: events 6–9 summarized (event 6 overlaps)

This continuity across boundaries is a key advantage over truncation.

Custom summarizer. Supply a different model (e.g., Gemini(model="gemini-2.5-flash") for cheaper summarization) or customize the prompt_template. Google reports 60–80% token reduction — their own estimate, unverified externally.

Multi-agent. EventsCompactionConfig applies at the App level. The Runner handles compaction for all agents under the same app — consistent settings, but not coordinated timing. During agent transfer, ADK performs "narrative casting" — re-labeling prior assistant messages so the new agent doesn't hallucinate it performed those actions. The include_contents parameter controls how much parent context flows to sub-agents: full (default) or none (sub-agent sees only the latest turn).

Setting	Value
Trigger	Event count (`compaction_interval`)
Strategy	Sliding window + LLM summarization with overlap
Configurable	`compaction_interval`, `overlap_size`, custom summarizer
Multi-agent	App-level config, Runner manages for all agents

The full landscape

Framework	Strategy	Trigger	Multi-Agent Compaction	Configurable
Claude Code	Summarization	~95% capacity	Per-agent independent	High
OpenAI SDK	Encrypted / Truncation	Threshold / auto	Per-agent, shared wrapper	Medium
LangGraph	Composable primitives	Developer-set	Per-agent via checkpointer	Very high
CrewAI	Summarization	Overflow detection	Per-agent independent	Low
AutoGen	Transforms / context types	Developer-set	Centralized option	Very high
Cursor	Flash-model summarization	Near limit	N/A (single-agent)	Low
Aider	Recursive summarization	Soft token limit	Shared history	Medium
Google ADK	Sliding window + overlap	Event count	App-level config	High

What the research says

Token budget math. The Phase Transition for Budgeted Multi-Agent Synergy paper (January 2026) formalizes the problem: star topologies saturate at approximately N ~ W/m agents (context window W divided by message length m). Hierarchical trees bypass this — each aggregation node enforces b*m <= W locally, enabling N = b^L total agents across L depth levels. This means compaction at each level isn't optional — it's a mathematical requirement for scaling.

Budget-aware tools. BATS (November 2025) introduces a Budget Tracker module that provides real-time resource status after each tool call. Four spending regimes: HIGH (≥70% remaining), MEDIUM (30–70%), LOW (10–30%), CRITICAL (under 10%). When iteration count exceeds a threshold, historical trajectories are replaced with concise summaries. Result: comparable accuracy with 10x less budget.

Compression quality. Factory.ai's evaluation compared three compaction methods across 36,611 production messages. Quality scores (0–5): Factory 3.70, Anthropic 3.44, OpenAI 3.35. Compression ratios: all above 98%. The critical finding: artifact tracking is uniformly weak — all methods scored 2.19–2.45/5.0 on remembering file modifications. No method reliably remembers what files were changed.

Context-aware compression. ACON (October 2025) uses failure-driven guideline optimization: given paired trajectories where full context succeeds but compressed context fails, an LLM analyzes the cause and updates compression guidelines. Result: 26–54% memory reduction while preserving 95%+ accuracy.

Why nobody coordinates

The absence of coordinated multi-agent compaction isn't an oversight — it's a consequence of how frameworks isolate context. In the dominant "context isolation with summary return" pattern, the parent never holds the child's full context. There's nothing to coordinate because there's nothing shared.

AutoGen's CompressibleGroupManager exists because AutoGen's group chat model inherently shares a single message stream across all participants. When everyone reads the same stream, centralized compression both makes sense and becomes necessary. No other framework shares context this way, so no other framework needs centralized compaction.

The Manus team found a related insight: early versions wasted approximately 30% of tokens on their planner constantly rewriting todo.md. Their fix wasn't better compaction — it was splitting the planner and executor into separate agents with separate context, so each could compact independently. The compaction problem became an architecture problem.

What this means for CrabTalk

For a local-first runtime where every token is computed on your own hardware, context compaction is a resource management problem. Wasted context tokens mean wasted GPU cycles.

Several findings inform our approach:

Sliding window with overlap (ADK's approach) provides predictable compaction timing and explicit continuity across windows. For a runtime that values reliability, structure matters more than flexibility.
Separate summarization models make sense with local inference. Running a small model for summarization while the main model thinks is nearly free on the same machine. Aider and Cursor both validate this pattern.
Persistent instructions shouldn't depend on compaction. If something must persist, it belongs in a file re-injected every turn — not in conversation history. We explored this in our persistent agent memory research.
Compaction should be observable. OpenAI's encrypted items are a black box. Compaction events should be visible to the user and to parent agents — not silently applied.
Context isolation is the real compaction strategy. The most effective frameworks don't compact better — they avoid needing to compact by isolating context per agent and communicating through compressed summaries. Multi-agent coordination, explored in our coordination research, is an architecture question first and a compaction question second.

Open questions

Should compaction be lossy? Every summarization approach loses information. OpenAI's encrypted compaction claims to preserve "latent understanding" but is unverifiable. Could a structured schema — "always preserve: file paths, function names, error messages, decisions; drop: intermediate reasoning, tool output details" — produce more reliable results?

When should compaction trigger? Claude Code triggers at 95% capacity. Google ADK triggers on event count. LangGraph lets you set a token threshold. The right trigger depends on the workload: event-based is predictable for tool-heavy agents, token-based is safer for conversation-heavy ones. Should the trigger be adaptive?

Can compaction be reversible? Aider keeps full chat history on disk while summarizing the in-context version. If a summarized-away detail turns out critical, could the agent re-expand from full history? No framework supports this, but it's architecturally possible with persistent storage.

Is centralized compaction worth the coupling? AutoGen's CompressibleGroupManager is unique. It solves the coordination problem but creates a single point of failure for context quality. If the group manager's summary is poor, every agent suffers. Is per-agent independent compaction actually the safer default?

How should compaction interact with persistent memory? If completed tasks flush to a durable store, does the context window even need to hold old conversations? Persistent memory provides retrieval. The context provides recency. Compaction becomes a bridge — and the question shifts from "what should the summary contain" to "what should be persisted before compacting."