We're dogfooding OpenWalrus as our own community bot — join us and help build it on Discord →

← Back to blog

Hermes memory: five layers, one learning loop

We examined Hermes Agent's five-layer memory system — procedural skills, Honcho user modeling, FTS5 search — and asked what it costs to remember this much.

research·OpenWalrus Team·

Hermes Agent remembers by doing. Complete a complex task, and it writes a SKILL.md — a step-by-step recipe it can retrieve next time. Ask it something personal, and Honcho derives a Theory of Mind snapshot from the conversation. Search for last week's work, and FTS5 pulls it from a SQLite index. Five memory layers, each solving a different temporal problem. No other open-source agent runtime attempts this much.

We examined Hermes Agent's memory architecture in depth — not the models or the terminal backends (we covered those in the runtime survey), but the memory system specifically. How the five layers interact, what each one costs, and what's missing.

Five layers, explained

Layer 1: Short-term inference memory

The context window. Every agent has this — it's the transformer's working memory for the current session. Hermes compresses at 50% context utilization (configurable) and caps tool orchestration at 90 iteration steps by default.

Nothing survives a restart. This layer exists to be lost.

Layer 2: Procedural skill documents

This is what makes Hermes's memory unique. When the agent completes a complex task — debugging a microservice, optimizing a pipeline — it autonomously writes a SKILL.md file capturing the step-by-step solution.

The format follows the agentskills.io standard:

  • Frontmatter: name (1-64 chars, lowercase + hyphens), description (1-1024 chars), optional license, compatibility, allowed-tools
  • Directory structure: SKILL.md plus optional scripts/, references/, assets/ subdirectories
  • Progressive disclosure: metadata loaded always (~100 tokens), full SKILL.md loaded on activation (under 5,000 tokens recommended), resources loaded on-demand

The creation trigger is the least documented part. It appears to be complexity-based — some heuristic of iteration count, tool calls, or solution novelty. The threshold isn't public, which makes it hard to predict when the agent will create a skill and when it won't.

Skills are stored locally at ~/.hermes/memories/skills/. They're plain files — inspectable, editable, portable. The agentskills.io standard means skills created in Hermes can theoretically work in 11+ other tools that adopt the spec.

Layer 3: Contextual persistence

A vector store indexes skill documents for workflow retrieval. When a new task resembles a past task, the system retrieves the relevant skill and uses it as a starting scaffold.

This is where layers 2 and 3 interact: layer 2 creates skills, layer 3 makes them findable. Without contextual persistence, the agent would have to know which skill to load by name. With it, the agent describes the task, and the closest matching skill surfaces.

Layer 4: User modeling via Honcho

Honcho is an external service from Plastic Labs that models users through what they call "dialectical reasoning." It doesn't store conversations — it derives conclusions from them.

The data model is peer-centric with four primitives:

PrimitivePurposeScope
WorkspaceMulti-tenant isolationTop-level container
PeerEntity representationBoth users and agents
SessionInteraction threadTemporal boundary
MessageAtomic data unitConversations, events, documents

The key insight is that both users and agents are "peers" — this enables multi-participant reasoning, not just one-way user profiling.

How it reasons: Custom reasoning models process messages asynchronously in the background, deriving Representations — Theory of Mind snapshots about each peer. These aren't raw transcripts. They're conclusions: "User has 10+ years Rust experience," "User prefers async communication," "User is working on a migration project."

How agents query it: Three retrieval methods:

  1. get_context() — returns a combination of messages, conclusions, and summaries up to a token budget. ~200ms latency.
  2. search() — hybrid text + semantic search across workspace, peer, or session scope.
  3. Dialectic chat — natural language queries to Honcho. The agent asks "What does this user care about?" and gets a reasoned answer, not a database row.

Configuration: enabled via user_profile_enabled: true and a HONCHO_API_KEY in ~/.hermes/config.yaml. This is the only layer that requires an external service.

Layer 5: Full-text search (FTS5)

SQLite FTS5 indexes all past interactions with LLM-powered summarization. Not raw logs — the system summarizes sessions before indexing, reducing noise and context pollution.

This layer answers temporal queries: "What did I do last Tuesday?" "What was the error I hit in the auth service last week?" Cross-session recall that the context window can't provide and that skill documents don't capture (skills are procedural, not episodic).

Every layer feeds back into the context window. The closed loop: tasks produce skills, skills improve future tasks, Honcho builds an evolving user model, FTS5 provides temporal recall. Each session is supposed to make the next one better.

The closed learning loop

The theory is compelling. An agent that gets better over time:

  1. Agent completes a task → writes SKILL.md
  2. Next similar task → vector store retrieves the skill → agent starts from a scaffold instead of zero
  3. Honcho observes the user → derives preferences → future sessions are personalized
  4. FTS5 indexes everything → temporal recall available across sessions

The question is whether this compounds in practice. We found no published benchmarks measuring skill reuse rates, user model accuracy over time, or degradation curves. The loop is well-designed in theory — the evidence gap is how it performs after months of heavy use.

Honcho: the user modeling question

Honcho's approach is fundamentally different from both Mem0's scope-based model and walrus's graph-based model.

Mem0 organizes memory by scope (conversation, session, user) and uses an LLM extraction pipeline to decide what goes where. The intelligence is in the extraction.

Walrus uses a single graph (LanceDB + lance-graph) with typed entities and explicit agent tools. The intelligence is in the agent — it decides what to remember.

Honcho derives conclusions from conversations without storing them. The intelligence is in the reasoning model that produces Representations. It doesn't store "user said they prefer TypeScript in message #47." It stores "user prefers TypeScript" as a derived fact.

This is closer to how humans remember — we forget the conversation, remember the conclusion. The risk is the same as with human memory: the conclusion might be wrong, and you can't go back to the source to verify.

Does it work? Honcho claims improved personalization and context-awareness. Honcho 3.0 added faster context retrieval and smarter embedding reuse. But we found no published A/B tests or benchmarks comparing agent performance with and without Honcho enabled. The contribution of user modeling to actual task completion is an open empirical question.

Memory System Coverage (0–10 scale)

The radar shows Hermes dominating on procedural memory and user modeling — the two capabilities that distinguish it from every other system. The gap on forgetting/decay is the most striking: Hermes scores 1 out of 10. It has no mechanism to forget.

Skill lifecycle: creation, retrieval, decay

Skills are Hermes's most original contribution. But the lifecycle has gaps.

Creation: autonomous, triggered by complexity heuristics. The threshold is undocumented — this makes it unpredictable. An agent might create a skill for a trivial task or miss a complex one.

Retrieval: vector similarity via the contextual persistence layer. The right skill surfaces for similar tasks. This works well when skill names and descriptions are distinctive. It works less well when skills overlap (three skills for "deploy to staging" created at different times with slightly different approaches).

What's missing:

  • Deduplication: No mechanism to detect that two skills solve the same problem. Mem0 uses cosine similarity (0.85 threshold) to merge near-duplicates. Hermes doesn't.
  • Versioning: No way to track skill evolution. If the agent rewrites a skill, the old version is gone.
  • Expiration: Skills never expire. A skill for "deploy to staging via Jenkins" persists long after you've migrated to GitHub Actions.
  • Conflict detection: Two skills with contradictory advice ("always use yarn" vs "always use pnpm") can coexist without any system-level awareness.

Memory Lifecycle Features (0–10 scale)

Hermes leads on auto-creation and portability (agentskills.io). Mem0 leads on deduplication. Nobody scores well on versioning. The expiration row is telling — Hermes scores 0, meaning skills accumulate indefinitely.

What's missing: forgetting

None of the five layers have a documented forgetting mechanism.

  • Skills accumulate in ~/.hermes/memories/skills/ with no pruning
  • FTS5 index grows with every session, no summarization decay
  • Honcho representations persist indefinitely — derived facts are never invalidated
  • Vector store indexes grow with the skill collection

Contrast this with:

  • Mem0's DELETE operation — the extraction pipeline can explicitly remove contradicted memories
  • Walrus's compact and distill tools — imperfect, but the agent can at least consolidate and prune
  • Cognitive science — Ebbinghaus decay curves suggest unused memories should fade. No agent framework implements this

The absence of forgetting is a design choice, not an oversight. Hermes bets that more memory is always better than less. This works at small scale. At 10,000+ skills and years of FTS5 logs, the signal-to-noise ratio is an open question. We explored the broader patterns of memory growth in our survey of persistent agent memory, and the context compaction survey covers how frameworks handle the related overflow problem.

Infrastructure cost

Hermes's memory is cheaper to run than it looks. Three of five layers are local:

LayerRuns locallyExternal serviceLLM calls
Context windowYesNone0 per op
Skills (SKILL.md)YesNone1 (creation only)
Contextual (vector)YesNone0 per op
Honcho (user model)NoHoncho API1+ per session
FTS5 (search)YesNone0 per query

Configuration lives in ~/.hermes/config.yaml:

memory:
  memory_enabled: true
  user_profile_enabled: true    # requires HONCHO_API_KEY
  memory_char_limit: 2200       # ~800 tokens for MEMORY.md
  user_char_limit: 1375         # ~500 tokens for USER.md

Each layer is independently disableable. You can run Hermes with just skills and FTS5 (fully local, no external services) or add Honcho for user modeling. The skip_memory parameter in the AIAgent() constructor disables persistence entirely.

Per-Layer Infrastructure Cost

The cost profile is bimodal: four layers are effectively free (local files, local SQLite), one layer (Honcho) requires an external service with LLM calls. FTS5 has the highest storage growth because it indexes every session.

How it compares

Hermes AgentMem0Walrus
Memory layers53 scopes1
Procedural memorySKILL.md (autonomous)NoneNone
User modelingHoncho (dialectical reasoning)User scope (LLM extraction)Graph entities (agent-driven)
Cross-session recallFTS5 + LLM summarizationVector similarity retrievalGraph traversal + journals
DeduplicationNoneLLM-powered (cosine 0.85)Upsert by key
ForgettingNoneDELETE operationcompact / distill tools
External services1 (Honcho, optional)4 self-hosted / 1 managed0
Skill portabilityagentskills.io (11+ tools)NoneNone
LLM calls per write1 (skill creation)2 (extract + decide)0

What walrus does differently

Walrus's single layer — LanceDB + lance-graph with six tools (remember, recall, relate, connections, compact, distill) — is a deliberate bet against complexity.

No skill documents, no user modeling service, no FTS5 index. The agent decides what's worth remembering and writes it to the graph. The agent queries the graph when it needs context. The agent compacts when memory grows.

Where Hermes wins: procedural memory is genuinely valuable. An agent that writes down how it solved a problem and reuses that solution later is a meaningful capability. Walrus doesn't have this — the agent can remember facts and relate entities, but it can't capture a multi-step procedure as a reusable unit.

Where walrus wins: zero external services, zero LLM calls per write, one mental model. When something goes wrong with memory in walrus, you inspect one graph. When something goes wrong in Hermes, you debug across five layers — was the skill created? Was it indexed? Did Honcho derive the right conclusion? Did FTS5 find the right session?

The deeper question: is five layers the right number? Could Hermes achieve 90% of the benefit with two layers (skills + FTS5) and skip the vector store, Honcho, and the complexity they add? The answer depends on whether user modeling and contextual persistence produce measurable improvements — and right now, that evidence doesn't exist publicly.

Open questions

  1. Does the closed learning loop actually compound? No published benchmarks measure skill reuse rates or user model accuracy over time. The architecture is sound — does it work after 500 sessions?

  2. What triggers skill creation? The complexity threshold is undocumented. Without knowing when the agent will or won't create a skill, developers can't rely on skill accumulation as a feature.

  3. Can five layers stay consistent? A skill says "use yarn." The user tells the agent "I switched to pnpm." The FTS5 index has sessions using both. Does the agent reconcile these, or does it depend on which layer it queries first?

  4. Does Honcho user modeling measurably improve task performance? Dialectical reasoning is a novel approach. But the value proposition — "the agent understands you better over time" — needs evidence. A/B test results, task completion rate comparisons, anything quantitative.

  5. What happens after six months of heavy use with no forgetting? Skills accumulate, FTS5 grows, Honcho representations multiply. At what point does the noise outweigh the signal? Does retrieval quality degrade as the corpus grows?

Further reading