Mem0: what three memory scopes actually cost

We examined Mem0's extraction pipeline, conflict resolution, and benchmark claims. Smart memory management is real — but most agents don't need three scopes.

Every agent memory system eventually faces the same question: when should the agent forget? Mem0's answer is to never let it come to that — an LLM-powered extraction pipeline watches every conversation, pulls out candidate memories, deduplicates them against a vector store, and asks a second LLM to decide whether each one should be added, updated, deleted, or ignored. It's the most sophisticated memory management pipeline we've examined. It's also the most expensive.

We dug into how Mem0 actually works: the extraction pipeline, the three memory scopes, the benchmark claims, and the infrastructure required to run it. Here's what we found.

The extraction pipeline

Most agent memory systems store what the agent explicitly asks to store. Mem0 takes a different approach: it watches every conversation and automatically extracts memories the agent never asked for.

How memories get created

Three inputs feed the extraction pipeline:

Latest exchange — the most recent user message and agent response
Rolling summary — a compressed summary of recent conversation context
Recent messages — the last m messages for continuity

An LLM processes these inputs and extracts candidate memories — concise facts, not full text. "User prefers TypeScript" rather than the full conversation where they mentioned it.

The four-way LLM decision

For each candidate memory, a second LLM call runs:

Vector similarity search retrieves existing memories similar to the candidate
The LLM sees the candidate and its nearest neighbors and decides one of four actions:
- ADD — genuinely new information, store it
- UPDATE — augment an existing memory with more recent or detailed info
- DELETE — the new information contradicts an existing memory, remove the old one
- NOOP — the fact already exists or is irrelevant, skip it

This is where the cost lives. Every memory write requires two LLM calls (extract + decide), plus a vector similarity search. Over a 100-turn conversation, that's 200+ LLM calls just for memory management.

Graph-based conflict resolution

Mem0's graph variant (Mem0ᵍ) adds a layer on top: a Conflict Detector that flags overlapping or contradictory nodes and edges, and an Update Resolver that determines merges, invalidations, or skips. This supports temporal reasoning — marking relationships as obsolete without deleting them.

The pipeline is technically impressive. The question is whether the overhead is worth it for most agent use cases.

Three memory scopes

Mem0 organizes memory into three scopes that map to different temporal horizons.

Conversation memory (short-term)

In-flight messages within a single turn. What was just said. This is what every agent framework has — the context window itself.

Session memory

Short-lived context within a single task or channel. Tool outputs, intermediate calculations, what the agent is currently focused on. Dies when the session ends.

User memory (long-term)

Persists across all conversations with a specific user. This is the most interesting scope — it contains:

Factual memory: preferences, account details, domain knowledge
Episodic memory: summaries of past interactions
Semantic memory: relationships between concepts for reasoning

The system stores each scope separately and merges them during query. The search pipeline pulls from all scopes, ranking user memories first, then session notes, then raw history.

The scope assignment problem: when the extraction pipeline identifies a new memory, which scope does it belong to? "User prefers TypeScript" is clearly user-scoped. "The current deployment is failing" is session-scoped. But "user is working on a migration to Rust" sits in a gray zone — it's user-level context, but it's temporary. Misclassification in either direction causes problems: user-scoped memories that should be session-scoped pollute all future sessions; session-scoped memories that should be user-scoped disappear when the session ends.

The benchmark claims

Mem0's research paper (Chhikara et al., April 2025) reports strong numbers.

LOCOMO results

On the LOCOMO (Long Conversation Memory) benchmark, Mem0 scores 66.9% on an LLM-as-Judge evaluation, compared to 52.9% for OpenAI's memory. The graph variant (Mem0ᵍ) adds roughly 2% on top.

Token savings and latency

Metric	Mem0 claim	Baseline	Source
Token savings	90% reduction	Full-context (26K → 1.8K tokens)	arXiv:2504.19413
Latency (P95)	91% reduction	Full-context (17.12s → 1.44s)	arXiv:2504.19413
Accuracy	26% relative improvement	LLM-as-Judge vs OpenAI memory	arXiv:2504.19413
LOCOMO F1	66.9%	LongMemEval benchmark	arXiv:2504.19413

What the paper actually measures

The 90% token savings compares selective memory retrieval (pull only relevant memories) against stuffing the full conversation history into the context window. This is a real comparison, but the baseline is generous — few production systems stuff raw history without any summarization. Against a properly compacted conversation, the savings would be smaller.

The paper doesn't report the total cost including the extraction pipeline itself. The 90% savings is on the retrieval side only. If the extraction pipeline adds 200 LLM calls over a 100-turn conversation, the total cost equation changes significantly.

The practical deployments the paper cites (RevisionDojo, OpenNote) report 40% token reduction — a more realistic figure that likely includes extraction overhead.

Infrastructure requirements

Self-hosted stack

Running Mem0 yourself requires:

Docker & Docker Compose v2 — orchestration layer
PostgreSQL + pgvector — vector storage
Neo4j — graph database for relationship memory
OpenAI API key — default LLM and embedding model (swappable for Ollama for fully local inference)

That's four external services before you store a single memory. The documentation estimates 2-5 minutes for initial setup, but production deployment (persistence volumes, auth, CORS, monitoring) is significantly more involved. The default configuration has no authentication or CORS restrictions — the docs explicitly warn about needing a reverse proxy before network exposure.

Managed service

Mem0's managed service at app.mem0.ai reduces this to a single API key. SOC 2 compliant, with audit logs and workspace governance. This is where the infrastructure complexity disappears — but the LLM extraction cost remains.

Infrastructure Requirements per Memory System

How it compares

	Mem0	CrabTalk	Graphiti (Zep)	Claude Code
Memory scopes	3 (conversation, session, user)	1 (unified graph)	1 (temporal KG)	1 (files on disk)
Storage backend	24+ vector stores + Neo4j	LanceDB + lance-graph	Neo4j	Markdown files
Extraction	LLM pipeline (extract + decide)	Agent tools (remember/recall)	LLM + temporal edges	Manual / auto-memory
Conflict resolution	Graph Conflict Detector + Update Resolver	Upsert (last write wins)	Bi-temporal invalidation	Manual edit
External dependencies	PostgreSQL, Neo4j, vector DB, OpenAI	None (embedded)	Neo4j server	None
LLM calls per write	2 (extract + decide)	0	1 (extraction)	0
Deployment	Docker Compose or managed cloud	Single binary	Docker + Neo4j	CLI / IDE
License	Apache 2.0	MIT	MIT	Proprietary

Memory System Capabilities (0–10 scale)

The radar shows the core tradeoff: Mem0 dominates on deduplication and conflict resolution. CrabTalk dominates on setup simplicity and schema flexibility. Neither wins everywhere — they're optimizing for different constraints.

What crabtalk does differently

CrabTalk bets on a single memory layer: LanceDB + lance-graph with three tables (entities, relations, journals) and six tools (remember, recall, relate, connections, compact, distill). No extraction pipeline, no scope disambiguation, no LLM calls per write.

The write path tells the story. Mem0 adds four steps between "something worth remembering happened" and "memory stored." CrabTalk has one: the agent calls remember and the fact goes into the graph.

Where this works: for agents that run tens to hundreds of sessions, the agent itself can manage deduplication through careful key naming and recall before remember. The LLM is already reasoning about the conversation — asking it to also decide what's worth storing is a smaller cognitive burden than running a separate extraction pipeline.

Where this breaks: at thousands of sessions with the same user, manual deduplication stops scaling. If the agent uses different keys for the same concept across sessions, duplicates accumulate. Mem0's similarity-threshold deduplication (0.85 cosine similarity triggers a semantic merge) catches these automatically. CrabTalk doesn't — yet.

We explored these memory architecture tradeoffs across five products in persistent agent memory research. Hermes Agent takes yet another approach with five memory layers — procedural skills, user modeling via Honcho, and FTS5 for cross-session recall. The context compaction survey covers how frameworks handle the overflow problem that drives memory systems in the first place.

Per-Operation Cost Profile (relative scale 0–10)

Open questions

Does the extraction pipeline pay for itself? Mem0 makes 2 LLM calls per memory write. At GPT-4o pricing, a 100-turn conversation costs roughly $0.30–0.80 just in memory management. The 90% token savings on retrieval are real — but do they offset the extraction cost? The paper reports savings on the retrieval side only, not total cost including extraction.
What happens when the conflict resolver gets it wrong? The graph-based Conflict Detector + Update Resolver is LLM-powered, which means probabilistic. If it incorrectly marks "prefers async/await in TypeScript" as conflicting with "prefers callbacks in Python" (different languages, different contexts), the user loses a valid memory. The paper reports aggregate accuracy but not conflict resolution precision.
Do most agents need three memory scopes? Conversation, session, and user memory is a clean taxonomy. But scope assignment is itself an LLM decision — misclassification creates problems in both directions. For many agent use cases (coding assistants, chatbots, task automation), a single-layer approach with explicit agent control may be simpler and sufficient.
Can a single-layer approach match Mem0 at scale? At 10,000 memories across 500 sessions, deduplication isn't optional — it's survival. Does crabtalk need to add dedup at the storage layer, or can smarter recall + remember patterns handle it?
Is the managed service the real product? Self-hosted Mem0 requires Docker + PostgreSQL + Neo4j + OpenAI. The managed service requires an API key. The complexity gap between the two is enormous. The open-source version may be more lead generator than standalone product — a pattern increasingly common in AI infrastructure.