We're dogfooding OpenWalrus as our own community bot — join us and help build it on Discord →

← Back to blog

Hermes Agent: what Nous Research built

We examined Hermes Agent's architecture — from Atropos RL training to persistent skill documents. Here's how it works and where it fits.

research·OpenWalrus Team·

Update (v0.0.7): The comparison table in this post lists walrus as having "built-in" local inference. As of v0.0.7, local inference was removed — OpenWalrus now connects to remote providers only. Memory and search are now external WHS services.

In February 2026, Nous Research released Hermes Agent — an open-source (MIT), Python-based agent runtime with persistent memory, autonomous skill creation, and local inference support via Ollama, vLLM, or llama.cpp. It positions itself "between a Claude Code style CLI and an OpenClaw style messaging platform agent." Six thousand GitHub stars in the first month.

We dug into how it actually works: the training pipeline that produces its models, the multi-level memory system that lets it learn across sessions, and the agentskills.io standard that makes its skills portable to 11+ other tools. Here's what we found.

The model stack

Hermes Agent runs on Hermes 3 and Hermes 4, a family of fine-tuned open-weight models from Nous Research. The models and the agent runtime are separate projects — Hermes Agent can use any OpenAI-compatible endpoint, but the Hermes models are purpose-built for agentic workloads.

Hermes 3 (August 2024)

Fine-tuned on Llama 3.1 at three scales: 8B, 70B, and 405B parameters. The technical report details the training:

  • Data: ~390M tokens of synthetically generated responses (not human feedback). 69% output tokens, 31% instruction tokens. Constructed March–August 2024.
  • Training: Two-phase — supervised fine-tuning (SFT) followed by direct preference optimization (DPO).
  • Packing: 96% sample packing efficiency at 8192-token sequences via Flash Attention 2 with attention masking to prevent cross-sample contamination.
  • Format: ChatML (<|im_start|> / <|im_end|> delimiters) for OpenAI API compatibility.
  • Function calling: Trained on the hermes-function-calling-v1 dataset — single and multi-turn function calling, structured JSON outputs, agentic scenarios. Tools specified in <tools> XML tags, invoked via JSON with arguments and name fields.

The predecessor model (Hermes 2 Pro) achieved 90% function calling accuracy compared to 60–70% for general-purpose models of similar size. Hermes 3 improved on this across multiple benchmarks while adding enhanced code generation and multi-turn conversation handling.

Hermes 4 (August 2025)

A significant jump. The technical report documents two major innovations:

Hybrid reasoning: Models toggle between standard responses and explicit deliberation using <think>...</think> tags. Thinking traces can extend up to 16,000 tokens. Users choose whether they want fast answers or detailed reasoning — the model adapts rather than always defaulting to verbose chain-of-thought.

DataForge: A graph-based synthetic data generation system that replaced the manual curation pipeline. Each node in a DAG performs a struct-to-struct transformation — converting simple seed data into complex training formats (e.g., Wikipedia article → rap song → Q&A pair). An LLM judge evaluates outputs on coherence, relevance, complexity, style, and tone, iterating until the sample passes or hits a maximum retry count.

The numbers tell the scaling story: Hermes 3 used 1M samples and 1.2B tokens. Hermes 4 uses ~5M samples and ~60B tokens — 5x more samples, 50x more tokens. Of those 5M samples, 3.5M are reasoning-heavy (intentionally longer) and 1.6M are non-reasoning.

Hermes 4.3 (36B) is particularly interesting: it's fine-tuned on ByteDance Seed 36B, not Llama. This breaks the assumption that all Hermes models share a Llama backbone. It achieves a 78.4% reduction in overlong reasoning generation on AIME'24 with only a 4.7% accuracy cost — solving the "model thinks for too long" problem that plagues many reasoning models.

Atropos

The training uses Atropos, Nous Research's distributed reinforcement learning framework. It's not standard RLHF — it's a rollout handler that manages asynchronous coordination across potentially thousands of distributed workers, addressing the challenge of highly variable LLM generation times. In Hermes 4 training, Atropos drives rejection sampling across ~1,000 task-specific verifiers to filter for high-quality reasoning trajectories.

Hermes Training Scale (log-adjusted for readability)

Agent architecture

The ReAct loop

Hermes Agent implements the classic ReAct pattern: Observation (read terminal output, file contents) → Reasoning (analyze state against goals) → Action (execute commands, call tools) → Loop. The innovation isn't the loop itself — it's what surrounds it.

Multi-level memory

Five layers of persistence, from ephemeral to permanent:

  1. Short-term inference memory: Standard transformer context within a single session. Nothing survives restart.
  2. Procedural skill documents: Persistent markdown files (SKILL.md) capturing step-by-step solutions to completed tasks. Created autonomously when the agent finishes something complex — debugging a microservice, optimizing a pipeline. Unlike standard RAG (which retrieves disjointed snippets), skills maintain cohesive procedural understanding.
  3. Contextual persistence: Searchable vector store indexing skill documents for workflow retrieval. When a new task resembles a past task, the relevant skill is retrieved and used as a starting scaffold.
  4. User modeling via Honcho: An entity-centric memory library from Plastic Labs. Represents both users and agents as "peers." Asynchronously reasons about peer psychology from messages, deriving facts and storing them in reserved collections. No messages = no reasoning = no memory. The model evolves over time: preferences, work patterns, domain expertise.
  5. Full-text search (FTS5): SQLite-based searchable database of all past interactions with LLM-powered summarization. Cross-session recall for "what did I do last Tuesday?" queries.

The closed learning loop ties these together: the agent completes tasks → creates skill documents → skills improve during subsequent use → periodic nudges prompt the agent to persist valuable knowledge → FTS5 enables cross-session recall → Honcho builds an evolving model of the user. Each session makes the next one better.

This is a different philosophy from walrus's graph-based memory (LanceDB + lance-graph with Agent/User/Episode nodes). Hermes leans on procedural knowledge (skill docs) and user modeling (Honcho). Walrus leans on relational knowledge (graph traversal) and episode replay. Both aim for the same outcome — agents that remember — but the representations differ. We explored these tradeoffs in persistent agent memory research.

Six terminal backends

Hermes Agent separates the agent runtime from the execution environment. Six backends implement a common BaseEnvironment interface:

BackendUse caseKey feature
LocalDevelopment, personal useDirect system execution, no isolation
DockerProduction, security-sensitiveRead-only root filesystem, dropped capabilities, PID limits, namespace isolation
SSHRemote serversPersistent environment across sessions
DaytonaCloud developmentServerless dev environments
SingularityHPC, research clustersContainer orchestration for compute-heavy workloads
ModalServerless productionHibernates when idle, wakes on demand, near-zero cost between sessions

Configuration is a single line in ~/.hermes/config.yaml: backend: modal. The agent code doesn't change — only the execution surface.

MCP (Model Context Protocol) support is built in. The client connects at startup, discovers tools from configured servers, and registers them as first-class tools. Automatic reconnection uses exponential backoff (1s → 2s → 4s → 8s → 16s, max 5 attempts). Both stdio-based (local subprocesses) and HTTP-based (remote StreamableHTTP) servers are supported.

The agentskills.io standard

The most consequential part of Hermes Agent might not be the agent itself — it's the agentskills.io standard it follows for portable skills.

A skill is a directory containing a SKILL.md file with YAML frontmatter and markdown instructions:

---
name: deploy-to-production
description: Safely deploy the current branch to production with rollback support
license: Apache-2.0
---

## Steps

1. Run the test suite and verify all tests pass
2. Create a tagged release from the current branch
3. Deploy using the project's deploy script
4. Verify the deployment health check endpoint
5. If health check fails, trigger automatic rollback

The standard specifies minimal required fields (name, description), optional metadata, and an unrestricted markdown body (recommended under 5,000 tokens). Optional directories (scripts/, references/, assets/) support more complex skills.

What makes this significant: 11+ tools have adopted agentskills.io — Claude Code, Cursor, GitHub Copilot, Gemini CLI, VS Code, Amp, Goose, Roo Code, Kiro, Codex, and OpenCode. A skill written for Hermes Agent works in Claude Code. A skill written for Cursor works in Hermes Agent. This is rare in the agent ecosystem — most skill/plugin systems are framework-specific.

Walrus's approach is different: markdown skill files with YAML frontmatter and tag-based lookup across three tiers (builtin, user, project). The format is similar in spirit (markdown + metadata), but walrus skills are designed for the walrus runtime specifically, not for cross-framework portability. Whether agentskills.io becomes the universal standard or fragments into vendor-specific extensions is an open question — we discussed this in the context of our skills design philosophy.

How it compares

Agent Runtime Capabilities (0–10 scale)

Hermes AgentClaude CodeOpenClawWalrus
LanguagePythonTypeScriptTypeScriptRust
Local inferenceOllama, vLLM, llama.cppNoNoBuilt-in
Memory5-level (FTS5, vector, Honcho, skills)Session-basedSession-basedGraph + vector (LanceDB)
Skillsagentskills.io (11+ tools)agentskills.ioCustomMarkdown + tags
Setuppip + model serverSubscription + IDEnpm + API keysSingle binary
Backends6 (local, Docker, SSH, serverless)IDE-embeddedCloud gatewayLocal process
MessagingTelegram, Discord, Slack, WhatsApp, SignalN/A20+ platformsTelegram, Discord
Stars6.1KN/A247KEarly stage
LicenseMITProprietaryMITMIT

The architectural divide is clear. Hermes Agent gives you the most flexibility: six execution backends, five memory layers, broad messaging support, portable skills. The cost is setup complexity — Python runtime, a separate model server (Ollama/vLLM), configuration files, and dependency management.

Walrus takes the opposite bet: one binary, built-in inference, zero external dependencies. Less flexibility, but the curl | sh to running-agent path is measured in seconds, not minutes. As we explored in how agents call agents, the framework's architectural choices cascade into everything from loop prevention to deployment patterns.

What the research says

The Hermes 3 technical report demonstrates that the 405B variant achieves state-of-the-art performance among open-weight models on several benchmarks. The function calling fine-tuning is particularly effective — the earlier Hermes 2 Pro achieved 90% accuracy compared to 60–70% for general-purpose models, a gap that Hermes 3 widened further.

The Hermes 4 report introduces the hybrid reasoning approach and validates it empirically: 78.4% reduction in overlong generation on AIME'24 with minimal accuracy cost. The DataForge pipeline's 60B-token synthetic dataset represents a bet that quantity and diversity of synthetic data, filtered by task-specific verifiers, outperforms smaller curated datasets.

A Render blog benchmark provides a striking finding: the same underlying model (Opus 4.5) achieves a 17-problem difference on SWE-bench depending on the agent scaffolding. Architecture matters more than model selection. This validates both Hermes Agent's investment in its ReAct loop + memory system and walrus's focus on runtime architecture — the model is necessary but not sufficient.

Honcho's user modeling approach (from Plastic Labs) represents an underexplored direction. Most agent memory systems focus on what the agent did (episodes, tool calls, outputs). Honcho focuses on who the user is — preferences, work patterns, domain expertise. Whether this produces meaningfully better agent behavior over time, or just accumulates an increasingly stale user profile, is an open empirical question.

Open questions

Does agentskills.io become the POSIX of agent skills? Eleven tools adopting the same standard is remarkable, but standardization has a history of fragmenting under pressure. When vendors need features the standard doesn't support (authentication, versioning, dependency management), do they extend agentskills.io or fork it? The SKILL.md format is deliberately minimal — which makes adoption easy but may make evolution hard.

Is Python + Ollama the right stack for local-first? Hermes Agent requires a Python runtime, a separate model server process, and configuration. This works for developers already in the Python ecosystem, but it's friction for anyone who isn't. A single binary that includes the inference engine (walrus's approach) removes an entire category of "it works on my machine" problems. The question is whether the flexibility of separate components outweighs the simplicity of a monolith.

Can autonomous skill creation actually compound? Hermes Agent's learning loop — complete tasks, create skills, improve skills during use — is the most ambitious memory system we've surveyed. But skills accumulate. Do old skills become stale? Do conflicting skills create confusion? Is there a pruning mechanism, or does the vector store grow unbounded? The agentskills.io standard says nothing about skill lifecycle management.

Does Honcho's user modeling outperform graph memory? Hermes models users as entities with derived facts. Walrus models relationships as graph edges with episode nodes. Both are persistent, both evolve. But they make different retrieval tradeoffs: Honcho retrieves user context ("this user prefers TypeScript"), walrus retrieves relational context ("last time this user asked about deployment, the agent used this approach"). Which produces better agent behavior at the 100-session mark?

DataForge's synthetic data pipeline: quantity vs quality? Hermes 3 used 390M tokens of curated data. Hermes 4 uses 60B tokens of DataForge-generated data — a 150x increase. The LLM judge provides quality filtering, but synthetic data can amplify biases present in the seed data. Does 60B tokens of synthetic data actually produce a better agent than 390M tokens of carefully curated data? The Hermes 4 benchmarks suggest yes, but the comparison isn't clean — the base model also changed (Llama 3.1 → ByteDance Seed).

Further reading