How AI frameworks control model thinking

We surveyed seven agent frameworks to understand who controls reasoning depth — the framework, the API, or nobody. The answers split three ways.

Every reasoning model can think harder if you ask it to. Claude has budget_tokens and effort. OpenAI has reasoning_effort. Google has thinking levels. The API surface exists. The question is: who decides when to use it?

We surveyed seven agent frameworks — Claude Code, Cursor, OpenClaw, GitHub Copilot, Windsurf, Aider, and Devin — to understand how they handle model thinking. The approaches split into three camps: frameworks that actively control reasoning depth via API parameters, frameworks that shape thinking through prompts and architecture, and frameworks that don't try to control it at all.

The three approaches

API-parameter-controlled

The framework translates user intent or task signals into provider-specific API parameters — thinking.budget_tokens, reasoning_effort, effort — before the request reaches the model. The model receives explicit instructions about how hard to think.

Prompt and architecture controlled

The framework doesn't touch reasoning API parameters. Instead, it shapes thinking through prompt design ("think step by step"), model selection (use a reasoning model for planning, a fast model for editing), or model routing (analyze the request and pick the right tier). The control is indirect.

Let it go

The framework sends the prompt to whichever model the user selected and lets the model decide how to reason. No API parameter tuning, no prompt engineering for reasoning depth, no dynamic routing. The user is the router.

Thinking Control Capabilities by Framework (0–10 scale)

Framework-by-framework

Claude Code — from keyword hacks to adaptive thinking

Claude Code has gone through three distinct eras of thinking control.

Era 1: keyword interception. Claude Code detected keywords in user prompts and mapped them to budget_tokens values. "think" mapped to 4,000 tokens. "megathink" mapped to 10,000. "ultrathink" mapped to 32,000 (the maximum). The model never saw these keywords — they were intercepted by Claude Code's preprocessing layer. It was a hack, and Anthropic deprecated it in January 2026.

Era 2: always-on extended thinking. After deprecation, extended thinking was enabled by default with maximum budget on every request. This worked but was wasteful — simple questions like "what does this function do?" triggered 32K tokens of thinking.

Era 3: adaptive thinking (current). The current system uses two API parameters together:

thinking.type: "adaptive" — Claude dynamically decides whether and how much to think based on request complexity
output_config.effort — a soft guidance signal with levels: low, medium, high (default), max (Opus only)

The /effort command in the CLI lets users switch between low, medium, and high. At high effort, Claude almost always thinks. At lower levels, it may skip thinking entirely for simple problems. Crucially, effort is "a behavioral signal, not a strict token budget" — the model can still think more or less than the level suggests.

Classification: API-parameter-controlled. Claude Code actively manages reasoning depth via API parameters, with the model making the final adaptive decision.

Cursor — the user is the router

Cursor does not control thinking depth. At all.

Users pick which model to use from a dropdown — GPT-4o, Claude Sonnet, Claude Opus, o3, Gemini. If you want deeper thinking, you select a thinking model. If you want speed, you select a fast model. Cursor's "Auto mode" picks a reliable model from the available pool, but the official documentation states it "does not route based on task type."

No reasoning_effort parameter. No budget_tokens. No dynamic model routing based on task complexity. The user decides.

Classification: let it go. Cursor outsources the thinking control decision entirely to the user.

OpenClaw — seven thinking levels and a router

OpenClaw has the most sophisticated thinking control of any framework examined.

Seven levels: off, minimal, low, medium, high, xhigh, adaptive. Each has natural-language aliases — "think" maps to minimal, "ultrathink" to high. The framework translates these into provider-specific parameters: Anthropic's budget_tokens, OpenAI's reasoning_effort, or binary on/off for providers like Z.AI and Moonshot that only support a toggle.

Resolution hierarchy (highest priority first):

Inline directive in current message (/t <level>, /think:<level>)
Session override
Per-agent configuration
Per-model default
Global default
Fallback: adaptive for Claude, low for other reasoning models, off otherwise

Dynamic model routing: Separately from thinking levels, OpenClaw supports ClawRouter — a 15-dimension weighted scorer that analyzes token count, code presence, reasoning markers, technical terms, and multi-step patterns to route requests to LIGHT (Haiku), MEDIUM (Sonnet), or HEAVY (Opus) tiers. It runs locally with under 1ms latency. A key design choice: it scores only user messages, not the system prompt, to avoid the large system prompt inflating every request to the most expensive tier.

Classification: API-parameter-controlled + dynamic routing. OpenClaw both controls reasoning depth per-request and routes requests to models of different capability.

GitHub Copilot — reasoning controls are coming

Copilot's approach is still evolving. Users can switch models mid-session with /model, choosing from Claude Opus, Sonnet, GPT-5.3-Codex, GPT-5 mini, GPT-4.1, and Gemini 3 Pro. The Copilot CLI changelog mentions "configure reasoning effort for extended thinking models," but this appears backend-only — not yet exposed as a user-facing control.

Classification: let it go (transitioning to API-parameter-controlled). Today, the user picks the model. Tomorrow, Copilot may expose reasoning effort controls.

Windsurf — model variants as reasoning levels

Windsurf takes a distinctive approach. Instead of exposing API parameters, it pre-configures model variants in the model selector: "GPT-5.4 (Low Reasoning)", "GPT-5.4 (Medium Reasoning)", "GPT-5.4 (High Reasoning)", "GPT-5.4 (Extra High Reasoning)". Similarly, "Claude Opus 4.6 (Thinking)" appears as a separate entry from standard Claude Opus.

Windsurf's custom SWE-1 model family takes this further with "variable thinking" — the model dynamically adjusts reasoning depth based on task complexity. Quick responses for simple tasks, deeper analysis for complex ones. This is native to the model, not framework-level control.

Different variants consume different amounts of prompt credits, making the cost-quality tradeoff visible to users.

Classification: API-parameter-controlled (via variant selection). Windsurf bakes reasoning levels into selectable model configurations rather than exposing raw API parameters.

Aider — the most explicit controls

Aider gives users direct access to reasoning parameters:

--reasoning-effort low|medium|high for OpenAI's reasoning_effort
--thinking-tokens 1k|8k|32k for Anthropic's budget_tokens
In-chat commands: /thinking-tokens 4k, /reasoning-effort low

Aider uses model metadata (accepts_settings) to determine which parameters each model supports and warns you if you try to set an unsupported parameter.

But Aider's most significant contribution to thinking control is architectural. The Architect/Editor pattern separates reasoning from editing:

An Architect model (often a reasoning model like o1 or R1) describes how to solve the problem
An Editor model (often GPT-4o or Sonnet) translates that plan into precise file edits

This produced state-of-the-art results: DeepSeek R1 as architect + Sonnet as editor achieved 64.0% on the aider polyglot benchmark at 14x less cost than the previous o1 SOTA. The insight: instead of making one model think harder, use two models — one for reasoning, one for execution. This mirrors the plan-vs-task separation we explored in agent design — plans are for reasoning, tasks are for execution.

Classification: API-parameter-controlled + architectural. Aider both exposes raw API parameters and introduces a model-pair architecture that implicitly controls where reasoning happens.

Devin — the black box

Devin doesn't expose thinking controls because Devin isn't a wrapper around a single model. It's a compound AI system — "a diverse set of model inferences to plan, act, evaluate, and use tools." Users set a spend limit per ticket (e.g., $5.00), and Devin allocates reasoning resources internally.

Cognition's blog describes Devin building explicit mechanisms to model user intent "across hundreds of millions of agent decisions." The internal architecture is proprietary. From the outside, it's a black box that thinks as hard as it thinks.

Classification: let it go (proprietary compound system). Devin controls thinking internally but gives users no knobs to turn.

The landscape at a glance

Framework	Approach	Thinking levels	Dynamic routing	User control
Claude Code	API parameters	4 (low/med/high/max)	No	`/effort` command
Cursor	Let it go	None	No (heuristic Auto)	Model dropdown
OpenClaw	API params + routing	7 levels + aliases	Yes (ClawRouter)	Inline `/t`, per-agent config
Copilot	Let it go (transitioning)	Emerging	No	Model selector
Windsurf	Pre-configured variants	4 per model	Partial (SWE-1)	Variant selector
Aider	API params + architecture	Direct param access	Architect/Editor	CLI flags + in-chat commands
Devin	Black box	Internal	Internal	Spend limit only

What the research says

The academic consensus is clear on one thing: more thinking tokens is not always better.

Don't Overthink It (Hassid et al., 2025) found that shorter reasoning chains are up to 34.5% more accurate than the longest chain sampled for the same question. Their short-m@k method achieves similar or superior accuracy while using 40% fewer thinking tokens.

Increasing the Thinking Budget is Not All You Need demonstrated that alternative strategies — self-consistency, self-reflection — outperform simply raising the thinking budget. More tokens doesn't mean better reasoning.

Think Deep, Not Just Long introduced the "deep-thinking ratio" metric, showing that raw token counts are unreliable proxies for reasoning quality. Increased generation length doesn't consistently correlate with accuracy and may signal "overthinking" that degrades performance.

Thinking Budget Research Findings (% change from baseline)

The most promising direction is self-budgeting: having the model estimate its own needed compute. TALE (ACL 2025) reduces output token costs by 67% while maintaining competitive accuracy by letting the model allocate its own reasoning budget.

Nous Research found that open-weight reasoning models use 1.5-4x more tokens than closed models on identical tasks — up to 10x for simple knowledge questions. The per-token cost advantage of open models is often negated by their token inefficiency. For local-first runtimes, this efficiency gap matters even more — every wasted thinking token is wasted compute on your own hardware. (We explored this cost dynamic in why we built CrabTalk.)

A comprehensive survey of adaptive test-time compute (Alomrani et al., 2025) frames the problem cleanly: current models are inefficient because they "often overthink simple problems while underthinking hard ones." The field is moving from fixed budgets to adaptive allocation.

Open questions

The landscape reveals several tensions without clear resolutions:

Should the framework or the model decide? Anthropic's adaptive thinking lets Claude decide when to think. OpenClaw's ClawRouter decides which model to use. Aider's architect/editor pattern decides where reasoning happens. Each delegates the decision to a different layer. Which one is closest to the signal — the framework that sees the user's full history, the model that understands the problem, or a lightweight router that can classify in under 1ms?

Is the Architect/Editor pattern the real answer? Aider's approach sidesteps the "how hard should this model think" question entirely. Instead of making one model think harder, it uses a reasoning model for planning and a fast model for editing — achieving SOTA results at 14x less cost. Does this generalize beyond coding, or is it specific to tasks with a clean plan-then-execute structure?

Do users actually want thinking controls? Cursor's "let it go" approach is the simplest and arguably the most popular. Most developers just want the right answer — they don't want to tune reasoning_effort or pick thinking levels. Is explicit control a power-user feature that becomes noise for everyone else? Or does the 34.5% accuracy gap between short and long chains mean that leaving it to chance is leaving quality on the table?

Can dynamic routing work at the framework level? OpenClaw's ClawRouter scores requests on 15 dimensions with under 1ms latency. But it only sees the user's message, not the full context. A request that looks simple ("fix this") may require deep reasoning once the agent reads the codebase. Is pre-request routing fundamentally limited, or can it be made context-aware without adding latency?

What happens when thinking costs approach zero? If inference costs drop 10x in the next two years (as they have in the past two), does the entire thinking-control problem dissolve? Or does the overthinking problem — where more thinking actively degrades accuracy — mean that budget control stays important regardless of cost?

Is "adaptive" just a better word for "uncontrolled"? Anthropic's adaptive thinking and Windsurf's variable thinking both let the model decide how much to reason. This works when the model's judgment about task complexity is good. When it's wrong — overthinking a simple question or underthinking a subtle bug — there's no user-visible feedback loop. Are we trading explicit control for implicit trust?

The frameworks that control thinking today are betting that reasoning is a resource worth managing. The frameworks that don't are betting that models will learn to manage it themselves. The research suggests both bets have merit — and neither has won yet.