First look at predictive intents for Hermes: Architecture should be benchmarked not debated

Architecture is a series of empirical questions. What currently exists? Do we build or buy? What’s the fastest? What is the simplest to deploy? And so on.

Now, answering these questions is easier than ever.

If you don’t test, it is just a guess.

I have stopped debating which tech stack or tool to use. I have started benchmarking individual components of the stack. For example, I am adding predictive intents to my agent. I want it to passively parse my intents from user memory, and I want the agent to just do stuff. I want the agent to figure out what I want done and do it without me having to ask. For example, if it realizes that I want to compile my blog posts into an e-book, I get a link to buy it on Amazon. Yes, this is scary.

When I think about this task, there are a few questions that arise in my mind:

One is data ingestion: how do we build and use the model of who you are?

Next is intent parsing: how does it label the data to distinguish which database entities are facts about you and which are intents?

Example Fact: I eat every day. Example Intent: I need to eat lunch before 1:00 PM. Example Deliverable: I get a notification of a food delivery.

Next is pruning stale intents:

Example Stale Intent: I wanted to visit a restaurant that is now out of business.

So how do we architect this in the agent age? An easy option is to use /goal or similar to solve the problem.

However, how do we verify that the agent has made the correct architectural choice?

For me, my architecture strategy is to start with the data model.

My instinct for this particular problem is that a graph database will work, but I don’t think it will scale well. I want to ship this as a feature for my hermes agent (shout out @NousResearch), and I want it to be portable. So, heavy graph DBs like Neo4j are already out because I don’t want the overhead.

That leaves us with some options to consider:

SQLite is already a dependency of hermes, so it’s a candidate. It goes against my graph DB instinct, but recursive CTE might be good enough without designing extra complexity.

There is Kuzu, which is a graph database that uses Cypher like Neo4j. I am used to cypher queries, but I am not really interested in adding dependencies for the sake of it. I am curious if it works better managing what is essentially a knowledge graph.

Finally, we could just use the baseline. Hermes manages memory files via FTS5, we could ask the LLM to synthesize. My gut tells me this is a bad idea for my use case from previous work managing graphs with LLMs.

So, in years past when software development was expensive, I’d have a discussion with my team, compare experience, and come to a conclusion, and just ship it. Someone on the team might be a strong proponent of one tech over the other. From my personal experience I’ve worked with people who think it is graph DB or nothing, and I have worked with grug brained staff engineers who would rather ship it with what it comes with and ask questions when something breaks. I personally don’t care, I want fewer words and more proof. (shout out @GregKamradt)

Now, that era is dead. It is easy enough to take a dataset, and run an experiment. I generated a dataset, built three docker containers, made the runner, and ran the test. The dataset was based on my agent’s memories. We are testing the ETL of the data now not crawlers or parsers so a plain artifact is fine.

The Experiment

I extracted a complete knowledge snapshot from my Hermes agent memory. It’s small since I’m new to hermes, it included 72 entities across 8 types: People, companies, topics, commitments, risks, opportunities, preferences, and skills. There were 59 edges across 17 relationship types.

Hermes extracted the relationships KNOWS, HAS_RISK, MITIGATED_BY, RELATED_TO, PREFERS, HAS_SKILL, and more.

I defined 10 benchmark queries that cover the traversal patterns the predictive intent tool needs.

I built three isolated Docker runners. Same Python base image. Same dataset. Same 10 queries. 10 iterations each. We are only testing the query engine.

The Results

SQLite:

  • 5 to 20 microseconds per query.
  • Zero new dependencies.
  • 56 kilobytes on disk.
  • Every query returned exact answers.
  • The entire benchmark suite completed in 2.7 milliseconds.

Kuzu:

  • 400 to 1,800 microseconds per query.
  • Requires pip install kuzu (84 megabyte binary).
  • 84 megabytes on disk.
  • Also returned exact answers, but one query errored.

Baseline (LLM via deepseek-v4-flash):

  • 2 to 5 seconds per query.
  • Requires API key and network.
  • Returned “I do not know” for 8 out of 10 queries because the default memory context passed to the LLM was only 1,400 characters.
  • The LLM does not know about your commitments, risks, or opportunities if they are not in the memory file.

Here is the average comparison:

Approach Average query time
SQLite 10 microseconds
Kuzu 853 microseconds
Baseline (LLM) 3.6 seconds

That is a big speedup from baseline to SQLite. The LLM baseline is not viable for structured queries. It also returns wrong or empty answers for most relational questions, and costs per-query API fees. The baseline LLM remains useful for unstructured reasoning over retrieved context, but it should not be the primary query engine.

Kuzu offers cleaner Cypher queries but is slower at this scale and larger on disk. Kuzu may be worth another look if the dataset explodes in size.

What Won

SQLite is the clear choice for an MVP. It is already present in every Hermes installation. It takes 10 microseconds per query. It stores the entire graph in 56 kilobytes. It can express all 10 benchmark traversals, including recursive CTEs and anti-joins. The only downside is verbose SQL for multi-hop patterns, which can be abstracted behind a Python query builder. And let’s be honest, we’re going to be making the agent do it.

My graph database instinct was wrong at this scale. The data does not need a graph engine. It needs a fast embedded query engine that is already in the stack. SQLite wins. What was right is my intuition that an artifact to recall beats an LLM guessing.

So the grug-brain staff engineer won, we will ship with sqlite because it’s already there and it is fast enough for the current scale.

How to DIY your own tests

It took me longer to write this article than it did to run the bench. To try it yourself, I suggest you start with a benchmark runner, you can build one similar to @greynewell’s mcpbr.

The technique I am using is simple. You just build a runner, a docker container for each experiment with its own prompt, tell your agent to run the experiment and dump the logs. You design what is being tested. There’s a bit of an art to using your own data or synthetic data or creating datasets based on publicly available information, but once you do it you will have your own custom testing framework to A/B test your agent skills and AI features.

I won’t be sharing the benchmark outputs for this one since I used my own PII data, but I encourage you to give it a try.

I will continue working on parsing intents from my notes, and when it is ready there will be a post about it.