We're dogfooding OpenWalrus as our own community bot — join us and help build it on Discord →

← Back to blog

Mem0: what three memory scopes actually cost

We examined Mem0's extraction pipeline, conflict resolution, and benchmark claims. Smart memory management is real — but most agents don't need three scopes.

research·OpenWalrus Team·

Every agent memory system eventually faces the same question: when should the agent forget? Mem0's answer is to never let it come to that — an LLM-powered extraction pipeline watches every conversation, pulls out candidate memories, deduplicates them against a vector store, and asks a second LLM to decide whether each one should be added, updated, deleted, or ignored. It's the most sophisticated memory management pipeline we've examined. It's also the most expensive.

We dug into how Mem0 actually works: the extraction pipeline, the three memory scopes, the benchmark claims, and the infrastructure required to run it. Here's what we found.

The extraction pipeline

Most agent memory systems store what the agent explicitly asks to store. Mem0 takes a different approach: it watches every conversation and automatically extracts memories the agent never asked for.

How memories get created

Three inputs feed the extraction pipeline:

  1. Latest exchange — the most recent user message and agent response
  2. Rolling summary — a compressed summary of recent conversation context
  3. Recent messages — the last m messages for continuity

An LLM processes these inputs and extracts candidate memories — concise facts, not full text. "User prefers TypeScript" rather than the full conversation where they mentioned it.

The four-way LLM decision

For each candidate memory, a second LLM call runs:

  1. Vector similarity search retrieves existing memories similar to the candidate
  2. The LLM sees the candidate and its nearest neighbors and decides one of four actions:
    • ADD — genuinely new information, store it
    • UPDATE — augment an existing memory with more recent or detailed info
    • DELETE — the new information contradicts an existing memory, remove the old one
    • NOOP — the fact already exists or is irrelevant, skip it

This is where the cost lives. Every memory write requires two LLM calls (extract + decide), plus a vector similarity search. Over a 100-turn conversation, that's 200+ LLM calls just for memory management.

Graph-based conflict resolution

Mem0's graph variant (Mem0ᵍ) adds a layer on top: a Conflict Detector that flags overlapping or contradictory nodes and edges, and an Update Resolver that determines merges, invalidations, or skips. This supports temporal reasoning — marking relationships as obsolete without deleting them.

The pipeline is technically impressive. The question is whether the overhead is worth it for most agent use cases.

Three memory scopes

Mem0 organizes memory into three scopes that map to different temporal horizons.

Conversation memory (short-term)

In-flight messages within a single turn. What was just said. This is what every agent framework has — the context window itself.

Session memory

Short-lived context within a single task or channel. Tool outputs, intermediate calculations, what the agent is currently focused on. Dies when the session ends.

User memory (long-term)

Persists across all conversations with a specific user. This is the most interesting scope — it contains:

  • Factual memory: preferences, account details, domain knowledge
  • Episodic memory: summaries of past interactions
  • Semantic memory: relationships between concepts for reasoning

The system stores each scope separately and merges them during query. The search pipeline pulls from all scopes, ranking user memories first, then session notes, then raw history.

The scope assignment problem: when the extraction pipeline identifies a new memory, which scope does it belong to? "User prefers TypeScript" is clearly user-scoped. "The current deployment is failing" is session-scoped. But "user is working on a migration to Rust" sits in a gray zone — it's user-level context, but it's temporary. Misclassification in either direction causes problems: user-scoped memories that should be session-scoped pollute all future sessions; session-scoped memories that should be user-scoped disappear when the session ends.

The benchmark claims

Mem0's research paper (Chhikara et al., April 2025) reports strong numbers.

LOCOMO results

On the LOCOMO (Long Conversation Memory) benchmark, Mem0 scores 66.9% on an LLM-as-Judge evaluation, compared to 52.9% for OpenAI's memory. The graph variant (Mem0ᵍ) adds roughly 2% on top.

Token savings and latency

MetricMem0 claimBaselineSource
Token savings90% reductionFull-context (26K → 1.8K tokens)arXiv:2504.19413
Latency (P95)91% reductionFull-context (17.12s → 1.44s)arXiv:2504.19413
Accuracy26% relative improvementLLM-as-Judge vs OpenAI memoryarXiv:2504.19413
LOCOMO F166.9%LongMemEval benchmarkarXiv:2504.19413

What the paper actually measures

The 90% token savings compares selective memory retrieval (pull only relevant memories) against stuffing the full conversation history into the context window. This is a real comparison, but the baseline is generous — few production systems stuff raw history without any summarization. Against a properly compacted conversation, the savings would be smaller.

The paper doesn't report the total cost including the extraction pipeline itself. The 90% savings is on the retrieval side only. If the extraction pipeline adds 200 LLM calls over a 100-turn conversation, the total cost equation changes significantly.

The practical deployments the paper cites (RevisionDojo, OpenNote) report 40% token reduction — a more realistic figure that likely includes extraction overhead.

Infrastructure requirements

Self-hosted stack

Running Mem0 yourself requires:

  1. Docker & Docker Compose v2 — orchestration layer
  2. PostgreSQL + pgvector — vector storage
  3. Neo4j — graph database for relationship memory
  4. OpenAI API key — default LLM and embedding model (swappable for Ollama for fully local inference)

That's four external services before you store a single memory. The documentation estimates 2-5 minutes for initial setup, but production deployment (persistence volumes, auth, CORS, monitoring) is significantly more involved. The default configuration has no authentication or CORS restrictions — the docs explicitly warn about needing a reverse proxy before network exposure.

Managed service

Mem0's managed service at app.mem0.ai reduces this to a single API key. SOC 2 compliant, with audit logs and workspace governance. This is where the infrastructure complexity disappears — but the LLM extraction cost remains.

Infrastructure Requirements per Memory System

How it compares

Mem0WalrusGraphiti (Zep)Claude Code
Memory scopes3 (conversation, session, user)1 (unified graph)1 (temporal KG)1 (files on disk)
Storage backend24+ vector stores + Neo4jLanceDB + lance-graphNeo4jMarkdown files
ExtractionLLM pipeline (extract + decide)Agent tools (remember/recall)LLM + temporal edgesManual / auto-memory
Conflict resolutionGraph Conflict Detector + Update ResolverUpsert (last write wins)Bi-temporal invalidationManual edit
External dependenciesPostgreSQL, Neo4j, vector DB, OpenAINone (embedded)Neo4j serverNone
LLM calls per write2 (extract + decide)01 (extraction)0
DeploymentDocker Compose or managed cloudSingle binaryDocker + Neo4jCLI / IDE
LicenseApache 2.0MITMITProprietary

Memory System Capabilities (0–10 scale)

The radar shows the core tradeoff: Mem0 dominates on deduplication and conflict resolution. Walrus dominates on setup simplicity and schema flexibility. Neither wins everywhere — they're optimizing for different constraints.

What walrus does differently

Walrus bets on a single memory layer: LanceDB + lance-graph with three tables (entities, relations, journals) and six tools (remember, recall, relate, connections, compact, distill). No extraction pipeline, no scope disambiguation, no LLM calls per write.

The write path tells the story. Mem0 adds four steps between "something worth remembering happened" and "memory stored." Walrus has one: the agent calls remember and the fact goes into the graph.

Where this works: for agents that run tens to hundreds of sessions, the agent itself can manage deduplication through careful key naming and recall before remember. The LLM is already reasoning about the conversation — asking it to also decide what's worth storing is a smaller cognitive burden than running a separate extraction pipeline.

Where this breaks: at thousands of sessions with the same user, manual deduplication stops scaling. If the agent uses different keys for the same concept across sessions, duplicates accumulate. Mem0's similarity-threshold deduplication (0.85 cosine similarity triggers a semantic merge) catches these automatically. Walrus doesn't — yet.

We explored these memory architecture tradeoffs across five products in persistent agent memory research. Hermes Agent takes yet another approach with five memory layers — procedural skills, user modeling via Honcho, and FTS5 for cross-session recall. The context compaction survey covers how frameworks handle the overflow problem that drives memory systems in the first place.

Per-Operation Cost Profile (relative scale 0–10)

Open questions

  1. Does the extraction pipeline pay for itself? Mem0 makes 2 LLM calls per memory write. At GPT-4o pricing, a 100-turn conversation costs roughly $0.30–0.80 just in memory management. The 90% token savings on retrieval are real — but do they offset the extraction cost? The paper reports savings on the retrieval side only, not total cost including extraction.

  2. What happens when the conflict resolver gets it wrong? The graph-based Conflict Detector + Update Resolver is LLM-powered, which means probabilistic. If it incorrectly marks "prefers async/await in TypeScript" as conflicting with "prefers callbacks in Python" (different languages, different contexts), the user loses a valid memory. The paper reports aggregate accuracy but not conflict resolution precision.

  3. Do most agents need three memory scopes? Conversation, session, and user memory is a clean taxonomy. But scope assignment is itself an LLM decision — misclassification creates problems in both directions. For many agent use cases (coding assistants, chatbots, task automation), a single-layer approach with explicit agent control may be simpler and sufficient.

  4. Can a single-layer approach match Mem0 at scale? At 10,000 memories across 500 sessions, deduplication isn't optional — it's survival. Does walrus need to add dedup at the storage layer, or can smarter recall + remember patterns handle it?

  5. Is the managed service the real product? Self-hosted Mem0 requires Docker + PostgreSQL + Neo4j + OpenAI. The managed service requires an API key. The complexity gap between the two is enormous. The open-source version may be more lead generator than standalone product — a pattern increasingly common in AI infrastructure.

Further reading