Hermes Agent: Self-Improving AI Runs Locally on RTX

Hermes Agent self-improving AI agent running locally on NVIDIA RTX GPU with neural network visualization

Hermes Agent by NousResearch — the self-improving AI agent endorsed by NVIDIA RTX AI Garage

Most AI agents reset completely between sessions. Hermes Agent does the opposite. Built by Nous Research and now endorsed by NVIDIA, Hermes writes skill files after every complex task, refines them through use, and builds a cross-session model of how you work. On May 10, it hit number one on OpenRouter’s global agent rankings — processing 224 billion tokens per day, ahead of every other agent on the platform. This week, NVIDIA made it official by featuring Hermes in the RTX AI Garage, optimized to run locally on RTX PCs and DGX Spark with Qwen 3.6.

An Agent That Compounds

The distinction that matters here is not speed or benchmark scores. It is whether an agent gets better the more you use it. Hermes does, through three interlocking mechanisms.

First, when Hermes completes a multi-step task — anything that requires five or more tool calls — it automatically writes a skill file to disk. No configuration needed; it happens on its own. The next time you hand it a similar task, it searches that file library using full-text search and retrieves the relevant procedure, executing faster and with fewer tokens.

Second, those skills are not static. The agent updates them as it finds better approaches. You are not managing a growing list of prompts manually; the agent curates its own toolkit.

Third, Hermes builds a cross-session user model via Honcho, a memory backend that reasons about your working patterns after each conversation. Stable preferences, project history, and context carry forward rather than disappearing when the session ends.

The result is an agent that compounds through use in a way that cloud-hosted assistants, which reset on every API call, structurally cannot match.

The Hardware Story: RTX and DGX Spark with Qwen 3.6

NVIDIA’s endorsement is not incidental. It is what makes the local angle credible at scale. The RTX AI Garage feature, published May 13, 2026, pairs Hermes with Qwen 3.6 — a family of models that changes the math on local inference.

Qwen 3.6’s 27B parameter model matches the accuracy of 400B-parameter models from the previous generation at one-sixteenth the memory footprint. The 35B model runs at roughly 20GB VRAM. On RTX PRO hardware, that translates to token generation speeds three times faster than a baseline CPU setup.

For developers who already own RTX hardware, this is not a future aspiration. The inference performance is here now, on hardware sitting on desks today.

At the higher end, NVIDIA DGX Spark — 128GB unified memory, one petaflop of AI compute — can sustain all-day agentic workflows running 120-billion-parameter mixture-of-experts models. Hermes runs as a persistent background process on DGX Spark, handling tasks continuously rather than as one-off invocations.

Getting Started

Installation is a single command on Linux, macOS, WSL2, and Android via Termux:

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

A desktop GUI installer is also available for Windows and macOS. Once installed, run the TUI interface:

hermes --tui

Hermes requires a model with at least a 64,000-token context window — smaller windows cannot maintain working memory for multi-step tool calls. It connects to Ollama, LM Studio, OpenRouter, AWS Bedrock, and NVIDIA NIM out of the box.

To verify the self-improvement loop is active, run a task that involves several tool calls: “Research the top five AI agent frameworks” works well. Hermes will surface the skill file it created, and you can inspect it directly in the working directory.

Total time from zero to a functional, memory-backed, skill-learning agent is roughly twenty to thirty minutes. Full setup instructions are in the official quickstart documentation.

The Self-Evolution Layer

Beyond the built-in skill system, a separate project — hermes-agent-self-evolution — applies DSPy and GEPA (Genetic-Pareto Prompt Evolution) to optimize skill files, tool descriptions, and system prompts offline. It reads execution traces to understand why tasks failed, not just that they did, and proposes improvements via pull requests against your local Hermes configuration.

No GPU training is required. Each optimization run costs roughly two to ten dollars in API calls and works with as few as three examples. The approach was presented as an ICLR 2026 Oral, which gives it more than casual credibility.

This is not core Hermes — it is an optional layer for teams that want to take the self-improvement loop further and measure it systematically.

Why It Matters

Hermes’s trajectory — 140,000 GitHub stars in under three months, 1,000-plus contributors, and 224 billion daily tokens on OpenRouter — is not hype waiting for correction. Token throughput on a third-party routing layer is a harder metric to inflate than stars. Real developers are routing real workloads through it.

The broader implication is that self-improving local agents are no longer a research concept. Hermes is shipping it in production, running on hardware that a significant portion of the developer community already owns, at a cost well below a monthly cloud AI subscription.

If you have been waiting for local agents to reach parity with cloud offerings, this is the inflection point worth paying attention to.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.