Jonathan Popham

Supermodel Public API Explainer

2026-04-20T00:00:00+00:00

The Supermodel API has nine public endpoints. Five of them return graphs. Four of them return things you do with graphs.

That split is deliberate. The graphs are the primitive. The analyses are applications of the primitive. Useful in their own right, but also a demonstration of what becomes easy once the graph exists. If you don’t see the thing you want on the analysis side, the graph is already there. Build it yourself.

This post walks through all nine. For each one: what it is, why we ship it, and what it’s good for.

The general shape of every endpoint is the same. You send a zipped repository. You get back a graph or an analysis. Large jobs return a 202 Accepted with a Retry-After and a job handle you can poll; small jobs return the result directly. Authentication is an X-Api-Key header.

Every request also takes an Idempotency-Key header: a string you choose that lets us deduplicate identical calls. Post the same key twice and you get the same job back instead of running it again. We recommend a content hash, usually the git commit SHA plus the endpoint name. For the Supermodel graph on next.js that looks like:

Idempotency-Key: nextjs:supermodel:a0376cf

If you only call one endpoint, make it POST /v1/graphs/supermodel. It bundles every primitive below into a single artifact, and it’s what our own internal tools consume by default. The rest of this page is reference for when you want a specific graph on its own. The full spec, including per-endpoint request/response schemas and an interactive playground, lives at docs.supermodeltools.com.

The primitives

Parse graph: `POST /v1/graphs/parse`

What it is. The lowest-level view of your code. We parse every source file with tree-sitter and emit the structural relationships: files contain symbols, symbols declare children, types extend other types, functions reference other functions by name. It’s the AST, flattened into a queryable graph instead of a tree you have to walk.

Why we ship it. Every analysis in this API starts here. Parse graphs are what you build on when you want to know not just “what calls what” but “what is what”: every class, function, type, constant, interface, with its position in the file and its relationship to the symbols around it. If you’re writing your own code intelligence tool, you should not be parsing source files yourself. You should be reading this.

What to do with it. Build your own symbol search. Build a custom “find all exports” pass. Layer your own reachability heuristics on top of the declarations we already resolved. Use it as the input to another analysis we haven’t written yet.

Dependency graph: `POST /v1/graphs/dependency`

What it is. File-level dependencies. Which file imports which file, across every language in the repo. Follows module resolution conventions per language so the edges actually mean something. The graph distinguishes local dependencies (files inside the repo) from external ones (third-party packages: npm, pip, go modules, crates) by node label: a file that imports lodash gets an edge into an ExternalDependency node; a file that imports ./utils gets an edge into a LocalDependency node pointing at another file. One graph, both worlds, one filter to split them.

Why we ship it. It’s the coarsest useful view of a codebase. Most architectural questions (“are these two subsystems actually separate?”, “what does this module depend on?”, “is this package a leaf or a hub?”) are questions about the dependency graph. The reason they feel hard to answer with grep is that they’re not string-matching questions. They’re graph questions.

What to do with it. Enforce layering rules. Find internal files that everyone depends on (the hubs you can’t change cheaply). Find files that depend on everyone (the integration layers). Filter to ExternalDependency nodes and you have a code-derived SBOM: which of your files actually pulls in which third-party package, not according to your lockfile, but according to the code. Render your architecture diagram from something real instead of something someone drew in 2023.

On next.js packages/ (2,308 files), the dependency graph comes back with 4,928 LocalDependency nodes and 361 ExternalDependency nodes, connected by 4,422 imports edges. Roughly a 14:1 ratio of internal to external dependencies. A number that tells you something real about the shape of the codebase.

Call graph: `POST /v1/graphs/call`

What it is. Function-level calls. Every resolved callsite from one function to another, across files and modules. Not just “file A imports file B”. Actually “function foo calls function bar, on this line, with this resolution.”

Why we ship it. The call graph is the thing every AI agent silently wants and doesn’t have. When an agent gets asked to modify a function, the first question it should ask is “who calls this?” The second is “what does this call?” Without a call graph, the agent has to reconstruct the answer with grep, one match at a time, and it will miss the ones that grep can’t see: method dispatch, re-exports, aliased imports. With a call graph, it’s a lookup.

What to do with it. Impact analysis. Dead code detection. Refactoring tools that know where the callers are. Pretty much any question that starts with “if I change this function…” is a call-graph query.

Domain graph: `POST /v1/graphs/domain`

What it is. A higher-level grouping of the codebase into domains and subdomains, the bounded contexts that would show up on a whiteboard if you asked the team to draw their architecture. The model is loosely based on C4, which defines four levels: System Context, Container, Component, Code. The four Supermodel graphs line up one-for-one with those levels. The whole codebase graph is the System Context. Domains are Containers, cohesive subsystems that could reasonably live in their own deployable. Subdomains are Components, cohesive groupings within a subsystem. Functions and classes from the parse and call graphs are Code. Same four levels C4 uses, computed from the source instead of drawn in a meeting.

The computation is a mix of structural and semantic signal. Our graph algorithms produce candidate groupings of nodes that make up the domains and subdomains. An LLM classification pass then names and describes each group, so you get back ProjectScaffolding and OptimizationService instead of domain_3 and domain_7. The output is hierarchical: domains contain subdomains, and subdomains contain the functions, classes, and files that belong to them. IDs line up across every graph in the API, so “show me the call graph restricted to the Auth domain” is a filter, not a separate request.

Inter-domain edges come back with semantic labels inferred from the code: coordinates_workflow_with, validates_input_for, transforms_data_for, monitors_health_of, with a generic DOMAIN_RELATES as the fallback. The intent is that the domain graph can be read straight into a diagram or straight into a prompt without a humans-only translation step in between.

Why we ship it. A call graph with 40,000 nodes isn’t legible. A domain graph with five to ten nodes is. The domain graph is what you hand to a human, or to an agent that’s about to write documentation, or to a reviewer who needs to know which subsystem a PR touches. It’s the zoomed-out picture, computed from the zoomed-in one so the two always agree.

It also solves the drew-it-once-never-updated problem. Most architecture diagrams live in a slide from 2022. This one regenerates from the code on every request, which means the picture of “what this system is” is always the picture of what it currently is.

What to do with it. Auto-generated architecture diagrams that stay honest. PR labels that say which domain changed, useful for routing review to the right team. Onboarding documents that don’t go stale because they’re regenerated from the code. A domain filter on every other graph query, so you can ask “show me the call graph for Auth” without wading through the rest.

Run it on the packages/ tree of next.js (2,308 files) and you get five domains (NextRuntime, ProjectScaffolding, OptimizationService, QualityControl, DeveloperTools) split across 11 subdomains, each with a description, a responsibility list, and the three or four files most central to it. Same input as the other graphs, legible architecture diagram out the other side.

Supermodel graph: `POST /v1/graphs/supermodel`

What it is. All of the above, bundled. The Supermodel Intermediate Representation (SIR) is a single artifact that contains the parse graph, dependency graph, call graph, and domain graph, cross-referenced and consistent, in one download. This is the endpoint to reach for by default. If you’re not sure which graph you need, you need this one.

Why we ship it. Most real tools want more than one of these at once. A dead code detector needs parse + call + entry points. An architecture doc generator needs domain + dependency. Fetching them separately means you pay for four analyses, you stitch them together yourself, and you hope the node IDs line up. The SIR is the version that’s already stitched.

What to do with it. Build the tool you actually wanted to build. The SIR is what our own internal analyses consume. If you’re doing anything non-trivial, start here.

The applications

These are four analyses we ship because we wanted them ourselves, and because each one is a worked example of what the graph is for. You can reproduce any of them from the graph primitives above. We ship them as endpoints because the common cases deserve a one-call answer.

Dead code analysis: `POST /v1/analysis/dead-code`

What it is. A ranked list of candidates for deletion. Symbols that are declared in the parse graph but unreachable in the call graph, starting from framework entry points (pages, controllers, route handlers, test files) and walking outward. Each candidate comes with a probability and a reason.

Why it’s an endpoint, not a recipe. Naive dead code detection is a bad experience. A call graph that doesn’t know about Next.js pages will tell you every page is unused. A parser that doesn’t know about barrel re-exports will flag every re-exported type. The endpoint is the version with those edge cases handled, so you get a list that mostly isn’t noise.

What to do with it. Run it in CI. Attach it to a PR bot that says “you added a function; here are three near it that we think nothing calls.” Feed it into an agent that’s about to write documentation, so the agent documents what’s alive.

We wrote about the benchmark results here. The short version: on the repo we measured most carefully, the graph-enabled agent was 30× cheaper in tool calls and 5× better at recall than the same agent with only grep.

Test coverage map: `POST /v1/analysis/test-coverage-map`

What it is. For every function in the codebase, whether it’s reachable from a test. Not “is this file imported by a test”. Actually, “does a test ever transitively call this function?” Computed from the call graph, with test files as roots.

Why it’s an endpoint, not a recipe. Coverage reports tell you which lines executed. This tells you which functions could execute from a test entry point. It’s a different question, and it’s more useful when you’re trying to decide what to write tests for, because it’s independent of whether anyone actually ran the suite. A function that’s not reachable from any test has no coverage no matter how high your line-coverage number is.

What to do with it. Prioritize where to add tests. Find the functions your critical paths go through that your tests don’t. Pair it with the impact endpoint to find high-blast-radius, low-coverage code: the parts most likely to ship a regression.

Circular dependency detection: `POST /v1/analysis/circular-dependencies`

What it is. All cycles in the dependency graph, found with Tarjan’s algorithm. Each cycle comes back as an ordered list of files, so you can see exactly which edges to cut to break it.

Why it’s an endpoint, not a recipe. You could run Tarjan’s yourself on the dependency graph. You probably shouldn’t; it’s a five-line function and we already wrote it. More importantly, circular dependencies are the kind of thing that sneaks in while no one is looking, so this is a CI-check endpoint, not a “I wonder if we have any” endpoint.

What to do with it. Fail a build when a new cycle appears. Gate merges on cycle count not increasing. When you do find cycles, treat the output as a list of refactoring targets ranked by how much of the codebase they tangle together.

Impact analysis: `POST /v1/analysis/impact`

What it is. Blast radius. Given a file or function, the transitive set of callers: everything that could break if you change it. Computed from the reverse call graph, with a depth cap and a grouping by domain so the answer is legible.

Why it’s an endpoint, not a recipe. This is the question an agent should be asking before every non-trivial edit, and it’s the question a reviewer should be asking before every approval. “This change touches 3 files” is meaningless. “This change touches 3 files and 127 callers across 4 domains” is the number you actually needed.

What to do with it. Attach it to your PR bot. Show the blast radius as a comment on every PR. Let your agent call it before it proposes a change so it knows whether it’s editing a leaf function or the thing under half the codebase. When an agent confidently refactors a function with 127 callers because it looked at 3 files, this is the endpoint that would have stopped it.

The point of the split

If you squint, the primitives and the applications do the same thing: they take your code and give you back a structured view of it. The difference is where the judgment happens.

On the primitive side, we make no decisions for you. We give you the graph as it actually exists in the code. What you do with it is your problem, and that’s the feature. If you disagree with our definition of “dead,” our definition of “blast radius,” our definition of “domain,” you can build your own version out of the graph and skip us entirely on that layer.

On the application side, we make the obvious decisions so you don’t have to. If you want dead code candidates, you want them with framework entry points handled, barrel re-exports handled, generated directories filtered. You don’t want to re-litigate those choices every time. The application endpoints are the version with the defaults that mostly work.

Both layers are real and both layers are supported. We’d rather ship a good application endpoint and a good primitive for the cases the application gets wrong than ship one without the other.

Against next.js

We pointed all nine endpoints at the packages/ tree of vercel/next.js (commit a0376cf, 2,308 files, 16MB zipped). Same zip, same API key, one call each.

Endpoint	Result
`POST /v1/graphs/parse`	19,445 nodes / 25,264 edges. 6,927 functions, 2,230 classes, 1,686 types, 361 external packages.
`POST /v1/graphs/dependency`	7,976 nodes / 4,422 imports. 4,928 `LocalDependency` : 361 `ExternalDependency` (≈14:1).
`POST /v1/graphs/call`	3,668 functions / 5,943 resolved calls.
`POST /v1/graphs/domain`	5 domains, 11 subdomains. 1,571 files, 6,999 functions, 2,170 classes assigned.
`POST /v1/graphs/supermodel`	19,463 nodes / 41,791 edges. All of the above, cross-referenced in one artifact.
`POST /v1/analysis/dead-code`	1,876 candidates across 11,248 declarations (~80s).
`POST /v1/analysis/test-coverage-map`	12.8% test-reachable coverage. 877 tested functions, 5,973 untested.
`POST /v1/analysis/circular-dependencies`	9 cycles, 321 files involved, 4 high-severity.
`POST /v1/analysis/impact` (targeted)	Top dependents for `packages/next/src/server/next.ts`: 71. Repo-wide top is `taskfile.js` at 145.

One note on impact: calling it without targets or a diff asks for a global coupling map, and on a repo the size of next.js the response blows past the payload limit. That’s the correct behavior. The useful question on a large repo isn’t “give me the entire blast radius of every file,” it’s “what breaks if I change this?” Scope the call with a diff or a target list and it comes back in about a minute and a half.

Try it

Every endpoint takes the same input: a zipped repository.

cd /path/to/repo
git archive -o /tmp/repo.zip HEAD

curl -X POST "https://api.supermodeltools.com/v1/graphs/supermodel" \
  -H "X-Api-Key: $SUPERMODEL_API_KEY" \
  -H "Idempotency-Key: $(git rev-parse --short HEAD)" \
  -F "file=@/tmp/repo.zip"

Swap /v1/graphs/supermodel for any of the eight other paths above and the call is identical.

Full reference lives at docs.supermodeltools.com. The CLI wraps all of this for the live-update workflow:

npm install -g @supermodeltools/cli
supermodel watch

We maintain the graphs. You build the tools.

Why we built Supermodel

2026-04-17T00:00:00+00:00

When mathematicians get to know each other, one question they may ask is “do you think that math is invented or discovered?”

This will give the asker an insight into the process of the other mathematician.

If they say mathematics is invented, they are saying that we begin with a blank slate and make the rules up as we go along.

If they say that mathematics is discovered, they are saying that everything that can be known is already known, just not by us yet.

The question is a matter of opinion. Some will give a balanced reply like, “We discover math by inventing mathematical techniques”.

In software engineering, we could ask a similar question, “do you think that software is invented or discovered?”

Until recently, the same question was boring to ask software engineers. Every piece of software that was ever written was invented by a person.

That isn’t true anymore. There’s an alien living in your computer now, and it writes programs that didn’t exist before you asked for them.

So the question is suddenly interesting. Did you invent your product, or did you discover it?

A clever engineer might answer the way a clever mathematician does: I discovered my software by inventing new techniques. Fine. But that dodges the thing that actually matters. If our job now involves routinely finding fully-formed worlds on our computers, then our job also now involves understanding what we ship. You can’t be responsible for code you don’t understand. And to understand something, you have to model it.

That’s what we set out to do. We built Supermodel to make models of agent-written software, so that the humans and agents working on it can actually know what’s there.

The model has to come from the code, not the LLM

There’s an obvious shortcut: have the LLM write the program, then ask the LLM to explain it. This works on toy problems. It falls apart on real ones.

The reason is simple. Programs are deterministic. Every symbol means exactly one thing. Every call goes to exactly one place. LLMs are probabilistic. They guess, confidently, based on what similar code usually looks like. A good guess about a real system is still a guess. When the system gets big enough, the guesses compound and the explanation drifts from the code.

If you want a source of truth about a program, you can’t ask something that hallucinates. You have to read the program itself.

At the logical level, every program is a graph. Symbols relate to each other in a parse tree. Functions call each other in a call graph. Modules depend on each other in a dependency graph. These aren’t metaphors; they’re the actual structure the compiler sees. A codegraph is all of them together, filtered and labeled so a human or an agent can reason about the system without drowning in it.

Engineers have always had codegraphs in their heads. The point of pride used to be keeping the whole thing up there. That works when you’re alone. It breaks the moment you need to collaborate, which is why onboarding a new engineer to a mature codebase takes months.

You are now collaborating with an alien every day. The onboarding problem is no longer monthly. It’s every context window.

Why agents need this too

You might reasonably ask: if the agent can write code without a graph model, why does it need one to work on code that’s already been written?

Because writing from scratch and modifying an existing system are different problems. When an agent generates a new function, probabilistic reasoning is an asset. The agent is drawing on everything it’s seen. When an agent edits a real codebase, probabilistic reasoning is a liability. It needs to know, not guess, where a symbol is defined, what calls it, what breaks if it changes. Guessing at the structure of code that already exists is how agents produce confident, plausible, wrong patches.

An agent grounded in a real codegraph stops guessing about the parts it can look up. We bet that software writers (human or computer) will always need models, and that the correct type of model for software is a graph, and the correct graph model is the one that we have created. We believe that graph models are infrastructure.

Try it

Our goal is to model any program in any language.

We’ve distilled our effort into this cli tool:

npm install -g @supermodeltools/cli

Go to your project and run:

supermodel watch

You’ll get a live graph model of your code that you or your agent can query. More at supermodeltools.com.

What Dead Code Taught Us About Building Tools for AI Agents

2026-03-30T00:00:00+00:00

We set out to build a code visualization tool. AI can write code faster than you can review it, and we wanted to give developers a way to keep up: interactive architecture graphs, real-time structure views, a shared picture of what’s actually happening in the codebase.

We quickly realized that to build any type of precise visualization, documentation, or code review, you first need a good graph. The graph is the precursor. And when we looked around, we saw the same thing everywhere: every code review tool, every documentation generator, every AI coding assistant that needs to understand codebase structure ends up building its own parser, its own import resolver, its own symbol graph. It’s the same foundational work rebuilt independently by dozens of teams. Nobody had put together one comprehensive set of graph primitives that’s well-maintained and available for anyone to build on top of.

So we decided to do that. We think code graphs are a core primitive, especially now, as the industry moves toward software factories where agents need structural understanding of what they’re working on. Our focus is on maintaining precise graphs and parsing so that everyone building on top doesn’t have to duplicate that effort.

This post is about dead code detection, the first tool we built on our own graph primitives, and the one we’ve benchmarked most extensively. We tested it across 14 real-world repositories, from 449-star libraries to 138K-star monorepos like next.js. The result: 156x cheaper, 11x faster, and 2x better performance than Claude Opus 4.6 alone. 94.1% average F1 with 100% precision across every task. But the thesis is bigger than dead code. We aim to make the following case: graphs are a primitive to code factories. This dead code removal tool is an example of what can be built with our public API. If you have your own interpretation of how this problem or another can be better solved with graph primitives, we are happy to provide you with the raw materials to do so.

The Dead Code Problem

We discovered the dead code problem by accident. We gave an agent a directory that had living code and dead code in it and told it to make documentation. It documented dead features as living.

With vibe-coded software, especially if there are multiple refactors, it’s very likely that there will be dead code left behind clogging the context. In practice, engineers have known when manually coding that it is so frustrating to edit a method and see no change, only to discover that the pattern has drifted from the spec and the method is dead. AI agents don’t have that intuition. They see every function as equally real.

Dead code clogs context windows, confuses agents, and wastes the most expensive resource in AI-powered development: tokens spent reasoning about code that doesn’t matter.

We had the insight that good prompting is high signal. We want to provide a high volume of high-signal context to the model and eliminate noise as much as possible. If we could identify and remove dead code before the agent sees it, we could dramatically improve the quality of every downstream task: documentation, code review, refactoring, feature development.

From Graphs to Dead Code Candidates

Our insight was that with a well-made call graph and a well-made dependency graph, in many cases we could discover “dead code candidates.” Naively, if you were to say “anything that is not imported or not called, it is dead.” However, with generated code patterns there may be things that are not called until the system is built. Additionally, framework entry points (Express route handlers, Next.js pages, NestJS controllers) are never “called” by your code; they’re invoked at runtime. Services gated by an API may have code that appears dead but isn’t, since the client could be on the other side of a network boundary: a REST handler with zero internal callers, a webhook endpoint waiting for external events, a plugin loaded by convention rather than by import.

However, with these constraints in mind, it’s possible to build an agent-enabled system that begins with a set of items that appear to be dead, ranked by probability. An intelligent system could self-improve with certain system knowledge. That is, project structures that follow a generator pattern typically have common directory names like target/. So this gives a system that can generate probabilistically more likely dead code candidates, with the caveat that there will be false positives that need to be sorted through.

Still, this greatly reduces the context load on an LLM. On smaller projects, an LLM can effectively trace the entire execution path inside of the context window. On larger projects this becomes increasingly infeasible. By using graph analysis primitives, we can eliminate a huge chunk of known noise. After that we can use agents to sort through candidates to remove false positives. Finally, over time we can learn how project structures and design patterns create false positives to make a more refined system that further reduces the false positives the agent needs to sort through.

The cumulative effect of this process is that we can build CI pipelines and refactoring tools that will reduce dead code with increasing accuracy and precision. The final outcome once the dead code is removed is less wasted context, fewer agent errors, and more work done.

Why Naive Reachability Isn’t Enough

Static analysis can trace imports and function calls. What it can’t easily see are the boundaries of indirection that make code appear dead when it isn’t:

Framework entry points. A Next.js page.tsx, a NestJS @Controller(), an Express route handler. None of these are “called” by your code. They’re invoked by the framework at runtime. A naive dead code detector would flag every API endpoint as unused.

Event-driven and plugin architectures. Webhook handlers, message queue consumers, dynamically loaded plugins. All registered through patterns that static analysis struggles to trace.

API boundaries. When a service exposes functions through a REST or GraphQL API, the callers live on the other side of a network boundary. The server-side handler has zero internal callers, but it’s the most critical code in the system.

Generated code patterns. Code generators (ORMs, gRPC stubs, GraphQL codegen) produce symbols that aren’t called until the rest of the system is wired up. These often live in conventionally-named directories like generated/, target/, or __generated__/.

Re-exports and type-level usage. A type that’s re-exported through a barrel file (index.ts), or a constant used only in type annotations. These are alive but invisible to call-graph-only analysis.

These aren’t edge cases. In a typical production codebase, they represent 30-60% of all exported symbols. Flag them all as dead and you’ve built a tool nobody trusts.

Our Approach: Probabilistic Candidates + Agent Verification

Instead of trying to build a perfect static analyzer (an impossible task), we designed a system that works with AI agents rather than replacing them.

Step 1: Graph Analysis. Parse the codebase with tree-sitter. Build the call graph and dependency graph. Run BFS reachability from identified entry points (framework conventions, main files, test files). Everything unreachable becomes a candidate.

Step 2: Probabilistic Ranking. Not all candidates are equally likely to be dead. We rank by signals: Is it in a generated directory? Does it follow a framework naming convention? Is it a type re-export? How deep is it in the import chain? This produces a ranked list of candidates, from “almost certainly dead” to “suspicious but uncertain.”

Step 3: Agent Verification. Hand the ranked candidates to an AI agent. The agent can read surrounding code, check for dynamic usage patterns, and apply judgment that static analysis can’t. The key insight: the agent’s job is now filtering a short list, not searching an entire codebase. This is dramatically more tractable.

Step 4: Learn and Refine. Track which candidates turn out to be false positives. Learn that projects using Next.js have page.tsx files that look dead but aren’t. Learn that __mocks__/ directories are test infrastructure. Feed this back into the ranking model.

The cumulative effect: each iteration produces fewer false positives for the agent to sort through, the verification gets faster and cheaper, and the system builds institutional knowledge about project patterns.

Benchmarking: How We Measured

We used mcpbr (Model Context Protocol Benchmark Runner), built by Grey Newell, to run controlled experiments. Grey also contributed critical fixes to the Supermodel API — including confidence calibration, OOM prevention, dead export detection, and the StreamReader fix that made baseline evaluation reliable — and built the codegraph-bench code navigation benchmark. The setup:

Model: Claude Opus 4.6 via the Anthropic API
Agent harness: Claude Code
Two conditions: (A) Agent with Supermodel MCP server providing graph analysis, (B) Baseline agent with only grep, glob, and file reads
Same prompt, same tools (minus the MCP server), same evaluation

Ground Truth: How Do You Know What’s Actually Dead?

This is the hardest part of benchmarking dead code detection. You need to know, with certainty, which symbols in a codebase are dead. We used two approaches:

Synthetic codebases. We built a 35-file TypeScript Express app and intentionally planted 102 dead code items: legacy integrations, deprecated auth methods, feature flags that were never cleaned up, replaced utility functions. We know exactly what’s dead because we put it there. This is useful for development but doesn’t reflect real-world complexity.

Real pull requests from open-source projects. This is where the benchmark gets interesting. We searched GitHub for merged PRs whose commit messages and descriptions explicitly mention removing dead code, unused functions, or deprecated features. The logic: if a developer identified code as dead, removed it in a PR, the tests still pass, and the PR was approved by reviewers and merged, that’s confirmed dead code.

For each PR, we extracted ground truth by parsing the diff: every exported function, class, interface, constant, or type that was deleted (not moved or renamed) became a ground truth item. The agent’s job is to identify these same items by analyzing the codebase at the commit before the PR, the state where the dead code still exists.

This methodology has a key strength: it’s grounded in real engineering decisions, not synthetic judgment calls. A human developer, with full context of the project, decided this code was dead. We’re asking: can an AI agent reach the same conclusion?

We tested against PRs from 14 repositories spanning small libraries to massive monorepos:

Repository	Stars	Task
track-your-regions	–	tyr_pr258
podman-desktop	16K	podman_pr16084
gemini-cli	7K	gemini_cli_pr18681
jsLPSolver	449	jslpsolver_pr159
strapi	71.7K	strapi_pr24327
mimir	–	mimir_pr3613
opentelemetry-js	3.3K	otel_js_pr5444
TanStack/router	14K	tanstack_router_pr6735
latitude-llm	–	latitude_pr2300
storybook	89.6K	storybook_pr34168
Maskbook	1.6K	maskbook_pr12361
directus	34.6K	directus_pr26311
cal.com	40.9K	calcom_pr26222
next.js	138K	nextjs_pr87149

Across 60+ benchmark runs, we evaluated both agents on precision (what fraction of reported items are actually dead), recall (what fraction of actually dead items were found), and F1 score.

Results

The Headline: 156x Cheaper, 11x Faster, 2x Better

Let’s start with the numbers that matter. Across 14 real-world tasks, each drawn from a merged PR in an open-source repository, the graph-enhanced agent using the Supermodel MCP server dominated the baseline agent on every dimension:

Metric	MCP (Graph) Agent	Baseline Agent	Improvement
Avg F1	94.1%	52.0%	2x
Avg Precision	100%	varies	Perfect
Avg Recall	90%	varies	–
Total Cost	$1.40	$219	156x cheaper
Total Runtime	28 min	306 min	11x faster
Total Tool Calls	28	4,079	146x fewer
Avg Tool Calls/Task	2	291	–
Head-to-Head Wins	11	0	–
Ties	3	3	–

100% precision across all 14 tasks. Zero false positives. Every single item the graph agent reported was confirmed dead code.

The baseline agent spent 4,079 tool calls grepping through codebases, trying to reconstruct call graphs at runtime. The graph agent made 28 tool calls total, 2 per task on average. It read the pre-computed analysis, reported the candidates, and was done. The graph pre-computes the expensive work, so the agent doesn’t have to.

Per-Task Breakdown

Here’s every task, sorted by the gap between MCP and baseline performance:

Task	Repo	Stars	MCP F1	Base F1	MCP P	MCP R
storybook_pr34168	storybook	89.6K	100%	0%	100%	100%
otel_js_pr5444	opentelemetry-js	3.3K	100%	17.6%	100%	100%
tanstack_router_pr6735	TanStack/router	14K	100%	12%	100%	100%
directus_pr26311	directus	34.6K	100%	14.3%	100%	100%
nextjs_pr87149	next.js	138K	88.9%	CRASH	100%	80%
latitude_pr2300	latitude-llm	–	92.3%	35.3%	100%	86%
calcom_pr26222	cal.com	40.9K	100%	57.1%	100%	100%
gemini_cli_pr18681	gemini-cli	7K	80%	42.9%	100%	67%
podman_pr16084	podman-desktop	16K	100%	67.7%	100%	100%
maskbook_pr12361	Maskbook	1.6K	81%	68.4%	100%	68%
tyr_pr258	track-your-regions	–	97.6%	81.6%	100%	95%
strapi_pr24327	strapi	71.7K	100%	100%	100%	100%
mimir_pr3613	mimir	–	100%	100%	100%	100%
jslpsolver_pr159	jsLPSolver	449	78.3%	78.6%	100%	64%

The storybook result is striking: 89.6K stars, massive monorepo, and the baseline agent couldn’t find a single confirmed dead code item. The graph agent found all of them. The same pattern plays out across OpenTelemetry JS (17.6% vs 100%), TanStack Router (12% vs 100%), and Directus (14.3% vs 100%). On next.js, the largest repo in the benchmark at 138K stars, the baseline agent crashed entirely. The graph agent scored 88.9% F1.

The three ties (strapi, mimir, jslpsolver) are instructive. On strapi and mimir, both agents achieved perfect scores – these tasks had clean, well-scoped dead code that even grep-based search could find. On jslpsolver, the baseline agent actually edged out the graph agent by 0.3 percentage points on F1, the only task where that happened. The graph agent’s 100% precision (vs the baseline’s lower precision) shows the tradeoff: the graph agent is more conservative, sometimes missing items the baseline stumbles onto, but never reports false positives.

What Changed: From 10% F1 to 94% F1

If you’ve been following our benchmarking journey, you’ll notice these numbers look dramatically different from our earlier results. In our February and early March runs, the graph agent achieved high recall but terrible precision – single-digit percentages, with hundreds or thousands of false positives per task. What happened?

Three things changed:

1. Parser improvements. Barrel re-export filtering, cross-package import resolution, class rescue patterns, and seven new pipeline phases dramatically reduced the candidate list. Fewer false candidates means fewer false positives.

2. MCP server instead of analysis dump. Previously, we pre-computed a large JSON analysis file and handed it to the agent. Files with 6,000+ candidates exceeded tool output limits, causing truncation and errors. The MCP server delivers candidates through a structured API call, solving the file size wall entirely.

3. Better agent prompting. We stopped asking the agent to verify candidates with grep (which was less accurate than the graph analysis it was checking) and instead told the agent to trust the graph analysis. This restored recall to expected levels and, combined with the improved parser, achieved the precision breakthrough.

The cumulative effect: the same architectural approach – graph-based candidate generation plus agent verification – went from promising-but-rough to production-grade. The thesis was right. The implementation needed iteration.

Failure Modes We Discovered (and Fixed)

1. The File Size Wall (Fixed)

Large analysis files exceed tool output limits. A 6,000-candidate analysis exceeds the 25K token tool output limit, so the agent either gets a truncated view or errors out.

Fix: Moving to the MCP server architecture eliminated this entirely. Instead of dumping a massive JSON file, the agent makes a structured API call and gets back a clean candidate list. This was one of the key changes that took us from single-digit precision to 100%.

2. API Recall Gaps (Mostly Fixed)

Sometimes the Supermodel parser misses ground truth items entirely. In earlier benchmarks, the Logto task found 0 of 8 ground truth items in the analysis. No amount of agent intelligence can find what the analysis doesn’t contain.

Root causes we identified and fixed: export default not tracked, type re-exports (export type { X } from) missed, test file imports not scanned, barrel re-export filtering. These parser improvements, combined with the MCP server delivery, are why recall went from 85% to 90% average across a larger and harder set of tasks.

3. Agent Verification Can Hurt Performance (Fixed)

This one surprised us. In an earlier benchmark run, we instructed the agent to verify each candidate by grepping for the symbol name across the codebase. The idea was sound: if a symbol appears in other files, it’s probably alive.

The result: recall dropped from 95.5% to 40% on our best-performing task (tyr_pr258). The agent’s grep verification was killing real dead code.

Why? The grep used word-boundary matching (grep -w). A function named hasRole would match the word hasRole appearing in a comment, a string literal, or a completely unrelated variable name in another file. The agent would see the match and mark the function as “alive.” A false negative introduced by the verification step.

The irony: the static analyzer had already performed proper call graph and dependency analysis to identify these candidates. The agent’s grep check was a less accurate version of what the analyzer already did. By asking the agent to verify the analysis, we made it worse.

The fix was simple: tell the agent to trust the analysis and pass through all candidates without grep verification. The lesson: don’t let a less precise tool override a more precise one. Graph-based reachability analysis is strictly more accurate than grep-based name matching for determining whether code is alive.

4. Agent Non-Determinism (Mitigated)

Same task, same config, different results. One run finds 3 true positives; the rerun finds 0. This is an inherent property of LLM-based agents.

The mitigation that worked: reduce the agent’s degrees of freedom. With the MCP server delivering a short, well-ranked candidate list via a structured API, the agent has almost no room to go off-track. The result is 2 tool calls per task on average, and highly reproducible outcomes. When the agent’s job is “read this list and report it,” non-determinism effectively disappears.

The Scaling Insight

This is the finding we keep coming back to. Across the 14 tasks:

Total MCP cost: $1.40. Total baseline cost: $219. That’s 156x cheaper.
Total MCP runtime: 28 minutes. Total baseline runtime: 306 minutes. That’s 11x faster.
MCP tool calls: 28 total (2 per task average). Baseline tool calls: 4,079 total (291 per task average).

The economics are striking. At 2 tool calls per task, the graph agent’s cost is nearly constant regardless of codebase size. The baseline agent’s cost scales with the size and complexity of the repo – 291 tool calls on average, but the variance is enormous. On next.js (138K stars), the baseline agent crashed before producing a result, consuming tokens the entire way.

As codebases grow:

Baseline cost explodes. More files means more tool calls spent building a mental model of the codebase.
Graph cost stays flat. The agent makes an MCP call, gets structured candidates, and reports them. Two tool calls whether the repo has 50 files or 50,000.
Baseline quality degrades. On the four largest repos (storybook, next.js, strapi, directus – all 34K+ stars), the baseline averaged 28.4% F1. The graph agent averaged 97.2% F1.
Graph quality stays high. 90% average recall and 100% precision regardless of repo size.

The graph absorbs the complexity that would otherwise land on the agent. This is the fundamental value proposition, and it applies to any tool built on graph primitives, not just dead code detection.

Lessons for Building AI-Powered Code Tools

1. Context engineering matters more than model capability

Same model, same tools, different input structure: 2x better F1, 156x cheaper. The model wasn’t the bottleneck. The signal-to-noise ratio of its input was.

This is the core lesson. Good prompting is high-signal prompting. The best thing you can do for an AI agent isn’t give it a smarter model. It’s give it pre-computed, structured, relevant context and eliminate the noise.

2. Pre-compute what you can, delegate judgment to the agent

Static analysis is good at exhaustive enumeration. AI agents are good at judgment calls. The worst outcome is making the agent do both: enumerate and judge. That’s 291 tool calls per task and $219 total for worse results.

The best outcome is a pipeline: graphs enumerate candidates, agents verify them. Each component does what it’s best at. Two tool calls per task and $1.40 total for better results.

3. Precision is achievable, not just aspirational

In our earlier benchmarks, we wrote that “precision is the frontier” – we were achieving high recall but single-digit precision on real codebases. We believed precision was solvable through better ranking and filtering. We were right.

100% precision across 14 tasks. Zero false positives. The combination of parser improvements, MCP server delivery, and better agent prompting solved a problem we’d been publicly struggling with for months. The lesson: iterate on the system, not the model.

4. Real-world codebases are dramatically harder than synthetic ones – but solvable

On our synthetic benchmark, we hit 95% F1. On real-world codebases in earlier runs, F1 was in the single digits despite high recall. We were worried this gap might be fundamental.

It wasn’t. With the right system improvements, real-world F1 rose to 94.1% average. The four largest repos in our benchmark (storybook at 89.6K stars, next.js at 138K, strapi at 71.7K, directus at 34.6K) averaged 97.2% F1. Repo size is no longer the limiting factor.

5. The system improves iteratively

Every benchmark run taught us something:

jsLPSolver taught us that well-organized small repos favor grep-based search
Maskbook taught us about the file size wall
Logto taught us about parser gaps in export default
Directus taught us about the analysis-dump failure mode
storybook taught us that the MCP server approach scales to massive monorepos

Each lesson fed back into the parser, the delivery mechanism, and the agent prompt. The system got 9x better on F1 (from ~10% to 94.1%) not through model improvements, but through better context engineering.

The Bigger Picture: Graphs as Factory Primitives

The industry is moving toward software factories. Automated pipelines where agents write, review, test, and deploy code with increasing autonomy. These factories need infrastructure primitives. The LLM is becoming commodity. What’s not commodity is the structural understanding of what agents are working on.

Dead code detection is one application. But the underlying primitive, a structured graph of code relationships, enables an entire category of tools:

Impact analysis: “If I change this function, what breaks?” (call graph)
Architecture documentation: “What are the domains and boundaries in this system?” (domain graph)
Dependency auditing: “Which packages are actually used?” (dependency graph)
Refactoring assistance: “Show me all the callers of this deprecated API” (call graph)
Security surface mapping: “What code paths lead from user input to database queries?” (call graph + data flow)

Each of these has the same structure: pre-compute the graph, rank candidates, let agents handle judgment. The graph is the primitive. The applications are built on top.

Every team building agent-powered workflows, whether it’s code review, documentation generation, CI pipelines, or full factory orchestration, needs this structural awareness. Right now, most of them are building it from scratch. We think there should be one well-maintained set of graph primitives that everyone can build on, rather than dozens of teams independently duplicating the same foundational work.

The Benchmarking Journey: What We Got Wrong Along the Way

Building the dead code tool was one thing. Benchmarking it honestly was harder. Here’s what we learned the hard way.

The evolution in numbers

Our benchmark results improved dramatically over three months of iteration:

Period	Avg F1	Avg Precision	Avg Recall	Tasks	Key Change
Feb 2026	~6%	~3%	~85%	10	Initial graph analysis dump
Mar 9, 2026	~10%	~6%	~97%	10	Parser improvements
Mar 30, 2026	94.1%	100%	90%	14	MCP server + prompt fixes

The jump from 10% to 94% F1 didn’t come from a better model. It came from three system-level changes: parser improvements that reduced false candidates, the MCP server that eliminated the file size wall, and prompt changes that stopped the agent from second-guessing the graph analysis.

Measuring the wrong thing

Our initial benchmark prompt told the agent to read the analysis file, then “verify” each candidate by grepping the codebase to see if the symbol appeared in other files. This seemed rigorous. The agent would filter false positives before reporting.

It backfired. On our best-performing task (tyr_pr258), recall dropped from 95.5% to 40%. The agent’s grep verification was less accurate than the graph analysis it was checking. A function named hasRole would match the word “hasRole” in a comment, a string literal, or an unrelated variable. The agent would incorrectly mark it as alive.

The lesson: don’t verify a precise tool with a less precise tool. Graph-based reachability is strictly more accurate than text search for determining if code is reachable.

Two layers of invisible caching

After implementing parser improvements (barrel re-export filtering, 7 new pipeline phases, class rescue patterns), we ran the benchmark expecting dramatic improvement. The numbers were identical to the previous run.

It took investigation to discover why: the benchmark had two layers of result caching. A local file cache keyed on the zip hash short-circuited the API call entirely. Even when we busted through that, the API’s server-side idempotency cache returned the old parser’s results because the input hadn’t changed (same repo, same commit, same zip).

We had to clear the local cache AND change the idempotency key to actually measure the improved parser. Without this, we would have published results that showed “no improvement” when the improvements were real but unmeasured.

What honest benchmarking looks like

These mistakes taught us that benchmark infrastructure has as many failure modes as the system being benchmarked. Our checklist now includes:

Cache invalidation: Clear all analysis caches when the parser changes
Prompt isolation: The benchmark prompt must not introduce behaviors (like grep verification) that interact with what we’re measuring
Agent behavior logging: Always inspect the agent’s transcript, not just the final numbers
A/B discipline: Change one variable at a time (parser version, prompt, agent model) or you can’t attribute results

All of our benchmark data, including the runs where we got it wrong, is available in our benchmark repository. Transparency about methodology matters more than impressive numbers.

Scream tests: validating beyond PR ground truth

One thing to note about our benchmarks: our ground truth only captures dead code that a human developer explicitly removed in a PR. In a multi-million line project, there could be lots of dead code that a targeted PR missed. In earlier benchmarks with low precision, some of our “false positives” may have been genuinely dead code that the human developer didn’t catch.

We’ve begun performing “scream test verification”: systematically deleting reported dead code candidates, then running the project build and CI suite to confirm that things are truly dead. If the tests still pass after deletion, the candidate was genuinely dead regardless of whether a human had flagged it. Early scream test results are consistent with our current precision numbers and have revealed dead code that humans missed.

Try It

The Supermodel API is available today. Generate a dead code analysis for your codebase:

# Create a repo archive
cd /path/to/repo
git archive -o /tmp/repo.zip HEAD

# Analyze (via the dead code endpoint)
curl -X POST "https://api.supermodeltools.com/v1/analysis/dead-code" \
  -H "X-Api-Key: $SUPERMODEL_API_KEY" \
  -H "Idempotency-Key: $(git rev-parse --short HEAD)" \
  -F "file=@/tmp/repo.zip"

Or use the Supermodel MCP server to give your AI agent direct access to graph analysis in real time.

The graph endpoints (call graph, dependency graph, domain graph, parse graph) are all available through the same API. Our focus is on maintaining precise graphs and parsing so that you don’t have to. If you’re building agent workflows, code review tools, documentation generators, CI pipelines, or factory orchestration, anything that needs structural understanding of a codebase, we want to be the graph layer you build on top of.

If you have your own interpretation of how this problem or another can be better solved with graph primitives, we are happy to provide you with the raw materials to do so.

We maintain the graphs. You build the tools.

Methodology Notes

Benchmark framework: mcpbr by Grey Newell, with Claude Code harness
Model: Claude Opus 4.6 (claude-opus-4-6-20260330)
Agent harness: Claude Code
Total benchmark runs: 60+ (Feb 6 - Mar 30, 2026)
Total cost: ~$220 across all runs (dominated by baseline agent runs)
Repositories tested: 14 open-source projects (449 - 138K GitHub stars)
Ground truth sources: Merged PRs with passing CI from real open-source projects
All runs logged with timestamps, configs, full agent transcripts, and structured metrics
Analysis engine: Supermodel MCP server (tree-sitter-based parsing, BFS reachability analysis)

Supermodel maintains precise code graph primitives so you don’t have to. Get started with the API or try the MCP server.

Jonathan Popham

Supermodel Public API Explainer

The primitives

Parse graph: POST /v1/graphs/parse

Dependency graph: POST /v1/graphs/dependency

Call graph: POST /v1/graphs/call

Domain graph: POST /v1/graphs/domain

Supermodel graph: POST /v1/graphs/supermodel

The applications

Dead code analysis: POST /v1/analysis/dead-code

Test coverage map: POST /v1/analysis/test-coverage-map

Circular dependency detection: POST /v1/analysis/circular-dependencies

Impact analysis: POST /v1/analysis/impact

The point of the split

Against next.js

Try it

Why we built Supermodel

The model has to come from the code, not the LLM

Why agents need this too

Try it

What Dead Code Taught Us About Building Tools for AI Agents

The Dead Code Problem

From Graphs to Dead Code Candidates

Why Naive Reachability Isn’t Enough

Our Approach: Probabilistic Candidates + Agent Verification

Benchmarking: How We Measured

Ground Truth: How Do You Know What’s Actually Dead?

Results

The Headline: 156x Cheaper, 11x Faster, 2x Better

Per-Task Breakdown

What Changed: From 10% F1 to 94% F1

Failure Modes We Discovered (and Fixed)

1. The File Size Wall (Fixed)

2. API Recall Gaps (Mostly Fixed)

3. Agent Verification Can Hurt Performance (Fixed)

4. Agent Non-Determinism (Mitigated)

The Scaling Insight

Lessons for Building AI-Powered Code Tools

1. Context engineering matters more than model capability

2. Pre-compute what you can, delegate judgment to the agent

3. Precision is achievable, not just aspirational

4. Real-world codebases are dramatically harder than synthetic ones – but solvable

5. The system improves iteratively

The Bigger Picture: Graphs as Factory Primitives

The Benchmarking Journey: What We Got Wrong Along the Way

The evolution in numbers

Measuring the wrong thing

Two layers of invisible caching

What honest benchmarking looks like

Scream tests: validating beyond PR ground truth

Try It

Methodology Notes

Parse graph: `POST /v1/graphs/parse`

Dependency graph: `POST /v1/graphs/dependency`

Call graph: `POST /v1/graphs/call`

Domain graph: `POST /v1/graphs/domain`

Supermodel graph: `POST /v1/graphs/supermodel`

Dead code analysis: `POST /v1/analysis/dead-code`

Test coverage map: `POST /v1/analysis/test-coverage-map`

Circular dependency detection: `POST /v1/analysis/circular-dependencies`

Impact analysis: `POST /v1/analysis/impact`