How AI agents remember: a survey of persistent memory

We surveyed how Claude Code, ChatGPT, Cursor, Windsurf, and others implement persistent agent memory — storage formats, compaction, and retrieval.

AI agents are stateless by default. Every session starts from zero — the context window fills up, the conversation ends, and everything is gone. But useful agents need to learn. They need to remember your preferences, your project structure, the mistakes they made yesterday.

We surveyed five products — Claude Code, OpenClaw, ChatGPT, Cursor, and Windsurf — to understand how persistent memory actually works in production. Here's what we learned.

A taxonomy of agent memory

Not all memory serves the same purpose. We identified six functional roles that keep appearing across products, even when they use different names for them.

Role	What it holds	Persistence	Example
Working memory	Current session context	Ephemeral	Chat history in context window
Agent profile	Agent-specific persistent knowledge	Durable, per-agent	CLAUDE.md, .cursorrules
User profile	User preferences, habits, personal info	Durable, cross-agent	ChatGPT's "memory" feature
Episodic memory	Chronological interaction logs	Timestamped	JSONL session journals
Semantic memory	Searchable knowledge base	Indexed	RAG-backed vector store
Date-anchored memory	Time-stamped facts that expire	Temporal	"User is on vacation until March 15"

Working memory is what most people think of — the chat history sitting in the context window. It's fast but volatile. When the window fills up, something has to go.

Agent profile is the agent's persistent identity. Claude Code uses CLAUDE.md files, Cursor uses .cursorrules. These are always loaded at session start — they tell the agent how to behave.

User profile is different from agent profile, though products often conflate them. Agent profiles are scoped to one agent instance. User profiles span agents — your timezone, your communication style, your name. ChatGPT's memory feature is user-scoped. Claude Code's CLAUDE.md is agent-scoped.

Episodic memory is the journal. Timestamped session logs — who said what, when, in what order. Usually stored as JSONL or in a database with temporal indices. Critical for debugging and context recall across sessions.

Semantic memory is the searchable layer. Vector embeddings, full-text search indices, or both. This is where RAG lives — the agent queries for relevant knowledge rather than loading everything into the prompt.

Date-anchored memory is the least common but arguably the most underbuilt. Facts with expiration dates — your current project deadline, a temporary API key, a colleague's vacation schedule. Most products store these the same way as permanent facts, which means they never expire.

How five products implement memory

Each product makes different tradeoffs across the memory stack. Here's where they land:

Inspectability vs Searchability (0–10 scale)

The orange bars show inspectability (can you read and edit the memory?) and the blue bars show searchability (can the agent retrieve relevant memories at scale?). Claude Code and Cursor maximize human control. OpenClaw maximizes machine retrieval. ChatGPT scores low on both axes from a developer perspective — it's accessible to end users but opaque to builders.

Claude Code (Anthropic)

Claude Code takes the simplest approach in this survey: files on disk.

CLAUDE.md files act as the primary persistent memory. One per project root, one global at ~/.claude/CLAUDE.md. Loaded into the system prompt on every session.
Auto memory accumulates in ~/.claude/projects/<project>/memory/ — build commands, architecture notes, debugging insights, workflow preferences. Written automatically based on interaction patterns.
Context compaction kicks in when the context window fills up. The system compresses prior messages automatically. Memory files persist across compaction boundaries.
No RAG, no vector search. Memory is loaded directly into the prompt or read from files. Retrieval is file-path-based, not semantic.
A growing third-party ecosystem fills the gaps: claude-mem adds semantic compression, memsearch provides markdown-first indexing, and Basic Memory offers MCP-based persistent context.

The bet here is on human readability. You can open CLAUDE.md in any text editor, see exactly what your agent knows, and change it. No database to query, no embeddings to inspect.

OpenClaw

OpenClaw has the most sophisticated retrieval pipeline of the products surveyed.

Multi-layer architecture: conversation history (working memory), long-term memory store (durable facts), and session indexing (episodic recall).
SQLite + sqlite-vec for storage — structured queries via SQL, semantic similarity via vector embeddings, all in a single file.
Hybrid search combines cosine similarity (semantic match) with BM25-style keyword matching. Neither method alone is sufficient — hybrid catches both conceptual and literal matches.
Pre-compaction memory flush: before trimming the context window, the agent is given an explicit turn to extract and persist all important facts. This is the most interesting pattern in the survey — the agent itself decides what matters.
Markdown-first philosophy for memory content, with LLM-generated session slugs for indexing (e.g., "debugging-auth-flow-march-7").

The pre-compaction flush is worth highlighting. Most systems lose information silently when compaction happens. OpenClaw turns compaction into an explicit memory-formation event.

ChatGPT (OpenAI)

ChatGPT's memory is the most user-facing and the least transparent.

User-controlled: you tell ChatGPT to "remember this" and it does. It also infers memories automatically from conversations.
Proprietary backend — no public documentation on storage format, compaction strategy, or retrieval mechanism.
Users can delete individual memories or clear all. A "Temporary Chat" mode disables memory entirely.
Tiered persistence: Plus and Pro users get longer-term memory. Free users get lightweight short-term continuity.

The accessibility is unmatched — non-technical users can manage memory through a simple UI. But there's no programmatic access, no way to inspect the storage layer, and no portability.

Cursor IDE

Cursor treats memory as configuration, not knowledge.

.cursorrules (now deprecated) was a plaintext file in the project root providing persistent instructions — essentially a system prompt extension.
The replacement, .cursor/rules/, is a directory of rule files with more granular control.
The community-driven Memory Bank pattern pushes this further: hierarchical rule loading organized by development phase (analysis, planning, creative, implementation). Only rules relevant to the current phase are loaded.
No embeddings, no search, no learned facts. Rules are static instructions written by the developer.

The Memory Bank pattern is telling. Users built an elaborate multi-phase memory system on top of a tool that only supports flat config files. The demand for real memory far exceeds what's offered.

Windsurf / Codeium

Windsurf adds automatic memory generation on top of manual rules.

The Cascade agent auto-generates memories in ~/.codeium/windsurf/memories/, capturing coding patterns and project context.
Memories are workspace-scoped — knowledge from one project doesn't bleed into another. Reasonable for code agents, but means nothing transfers.
Can infer agent configuration from AGENTS.md files.
Enterprise tier adds system-level rules that admins deploy org-wide.

The workspace scoping is a deliberate tradeoff. It prevents context pollution between projects but also prevents learning that should transfer (your preferred test framework, your naming conventions, your error-handling patterns).

Feature coverage across products

Which memory roles does each product actually implement? The radar chart below scores each product across all six memory roles.

Memory Role Coverage by Product (0–10 scale)

OpenClaw dominates episodic and semantic memory — its hybrid search pipeline covers the most ground. Claude Code has the strongest agent profile support but almost no semantic recall. ChatGPT leads on user profiles but scores low on everything developers care about. Cursor is a flat line — strong on agent profile, near-zero on everything else.

The scatter chart shows the same data from a different angle — how many memory roles each product covers (x-axis) vs. how dynamically it learns (y-axis):

Memory Capability Coverage

Storage formats: markdown, SQLite, or vectors?

The storage format determines everything downstream — what you can query, what you can inspect, and what happens when things go wrong.

Product	Storage	Search	Compaction
Claude Code	Markdown files	File path	Context window auto-compaction
OpenClaw	SQLite + sqlite-vec	Hybrid (cosine + BM25)	Pre-compaction flush
ChatGPT	Proprietary	Unknown	Unknown
Cursor	Text / Markdown	None	Phase-based pruning
Windsurf	Local files	None	Workspace isolation
Mem0 (infra)	DB-agnostic	Pluggable	Multi-stage extraction

Markdown files (Claude Code, Cursor, Windsurf) are human-readable, git-friendly, and require zero dependencies. You can cat your agent's memory, edit it with vim, and commit it alongside your code. But there's no semantic search — you're limited to what fits in the context window.

SQLite + vectors (OpenClaw) gives you structured queries, full-text search via FTS5, and semantic similarity via embeddings. The cost is opacity — you need tooling to inspect memories, and the embedding model becomes a dependency.

Proprietary backends (ChatGPT) scale in the cloud and abstract away storage entirely. But your memories aren't portable, inspectable, or version-controllable.

The fundamental tradeoff is inspectability vs. searchability.

Markdown is maximally inspectable but unsearchable at scale. Vector databases are maximally searchable but opaque. The products developers trust most — Claude Code, OpenClaw — choose inspectable formats and layer search on top, rather than starting with an opaque database.

Compaction: what happens when the context window fills up

Every agent eventually runs out of context space. What happens next defines the quality of long-running interactions.

Naive truncation drops the oldest messages. Simple, but destructive — it loses critical early context like system prompts and initial instructions. Most products have moved past this.

KV cache compaction works at the inference layer. Recent research demonstrates 50x context reduction with minimal quality loss by compressing key-value attention caches mathematically. This is transparent to the application — the model sees a compressed but semantically equivalent context.

Hierarchical summarization mirrors human memory: working memory overflows into episodic logs (timestamped transcripts), which are periodically summarized into semantic memory (searchable facts). The pipeline looks like:

Anchored iterative summarization avoids reprocessing the entire history on every compaction. Only new message spans are summarized and merged with existing summaries. This is cheaper and avoids the progressive degradation that comes from summarizing summaries.

Episode pagination segments conversations at natural cognitive boundaries — topic shifts, tool-use completions, user-initiated breaks. Each episode becomes an independently retrievable unit, which dramatically improves recall precision compared to arbitrary chunking.

Pre-compaction flush is the most elegant pattern we found. Before trimming the context window, the agent gets an explicit turn to extract and persist all important facts. The agent itself decides what matters — not a heuristic, not a fixed window. OpenClaw implements this, and it's the pattern we're most interested in adopting.

Research from Mem0 shows that smart compaction isn't just about saving tokens — it improves reasoning. Their benchmarks report 5-11% improvements in reasoning tasks and 91% p95 latency reduction compared to full-context baselines. Compacting intelligently is better than throwing everything into the prompt.

Patterns worth stealing

Five patterns emerged from this survey that we think every agent memory system should consider.

Memory as a hook, not a hardcoded subsystem. OpenClaw implements memory through extensible interfaces rather than baking storage decisions into the core. This lets users swap backends without changing agent logic.

Dual-store architecture. Keep a fast, inspectable format (markdown, TOML) for agent profiles and user preferences. Use a searchable store (SQLite + FTS, vectors) for episodic and semantic memory. Don't force everything into one format.

Pre-compaction flush. Before trimming context, give the agent an explicit turn to extract and persist important facts. This turns context compaction from a lossy operation into a memory-formation event.

Profile vs. recall separation. Agent profiles (always-loaded identity) and recallable knowledge (searched on demand) serve different purposes. Conflating the two — loading everything into the prompt or searching everything on demand — creates either bloated prompts or slow retrieval. The best systems separate these concerns explicitly.

Human-readable by default. Every product that gained developer trust stores memory in formats humans can read and edit. Opaque databases create anxiety. Even when you add a searchable layer, the canonical format should be something you can open in a text editor.

Temporal knowledge graphs. Pure vector retrieval loses relationships and time. A graph where entities are nodes and facts are edges — with timestamps tracking when each fact was true, not just when it was stored — outperforms flat RAG on temporal reasoning tasks. Zep's research shows 18.5% higher accuracy and ~90% lower latency compared to vector-only baselines on complex temporal queries. The key is bi-temporal tracking: separating when a fact was recorded from when it was actually true. This is how "user is on vacation until March 15" can auto-expire without manual cleanup.

Open questions

This survey raised more questions than it answered. Here are the ones we keep coming back to.

Can one storage layer do it all? Markdown is inspectable but unsearchable. Vector databases are searchable but opaque. Every product picks a side or bolts one onto the other. Is there a single storage primitive that gives you both — human-readable and semantically searchable — without the complexity of maintaining two separate systems?

Should memory be a graph? Flat key-value memories lose relationships. "Alice works on Project X" and "Project X uses Rust" are two disconnected facts in a vector store — but a graph trivially connects them. Zep's research shows 18.5% accuracy gains from graph-based retrieval on temporal queries. But graphs add complexity. Where's the crossover point where the complexity pays for itself?

Who decides what to remember? Most products use heuristics or let users explicitly say "remember this." OpenClaw's pre-compaction flush is more interesting — the agent itself decides what matters before context is trimmed. But agent-driven memory formation introduces a new failure mode: the agent might remember the wrong things, or forget the right ones. How do you evaluate memory quality?

How should memories expire? Date-anchored memory is the most underbuilt category in this survey. "User is on vacation until March 15" should auto-expire. But most systems store it identically to permanent facts. Bi-temporal tracking (separating when a fact was recorded from when it was true) solves this in theory — but no product we surveyed implements it well in practice.

Can memory transfer across agents? Cursor and Windsurf scope memory to a single workspace. Claude Code scopes to a project directory. ChatGPT scopes to a user but not to a task. None of these scoping models feel right. Your preferred test framework should follow you everywhere. Your current project's auth implementation should not.

If you're building agent memory systems, we'd love to compare notes — open an issue on GitHub or find us in the discussions.