This document gives the exact commands to reproduce every benchmark number
shown in the README and the diagrams/. Two people running the recipe below
on different machines on different days should produce identical numbers,
within float rounding.
If you get different numbers, that's a bug — please file an issue.
The CLI's Token Savings panel uses a chars / 4 approximation labelled
estimated: true, not a model-specific tokenizer. The approximation is
designed to be both fast (no model load, no inference) and conservative.
pip install tiktoken
code-review-graph detect-changes --brief --verifyThe panel grows a Verified (tiktoken) row showing the same calculation
done with OpenAI's cl100k_base tokenizer (the GPT-4 family). If the
estimate is significantly off, you'll see it immediately:
┌───────────────────────── Token Savings ─────────────────────────┐
│ Full context would be: 12,921 tokens │
│ Graph context used: 762 tokens │
│ Saved: 12,159 tokens (~94%) │
│ Verified (tiktoken): 10,835 tokens (~93%) [11,611 → 776] │
│ Breakdown: Functions 244 · Tests 191 · Risk 244 · Other 83 │
└─────────────────────────────────────────────────────────────────┘
A one-time calibration across 222 files / 2.2 MB of mixed source (Python, JS, TS, Go, Rust, RST, MD) pulled from the 6 test repos:
| Repo | sample files | bytes | chars/4 estimate | tiktoken real | ratio est/real |
|---|---|---|---|---|---|
| flask | 46 | 470,179 | 117,559 | 109,969 | 1.069 |
| fastapi | 38 | 156,224 | 39,072 | 34,897 | 1.120 |
| gin | 30 | 471,793 | 117,962 | 132,296 | 0.892 |
| express | 23 | 296,805 | 74,207 | 83,575 | 0.888 |
| httpx | 38 | 254,184 | 63,556 | 62,909 | 1.010 |
| code-review-graph | 47 | 539,206 | 134,820 | 120,760 | 1.116 |
| OVERALL | 222 | 2,188,391 | 547,176 | 544,406 | 1.005 |
chars / 4 is within +0.5% of real GPT-4 tokens in aggregate. Per-repo
it swings between -11% (gin: lots of short Go identifiers) and +12%
(fastapi: heavy docstrings and type hints), but the ratio stabilizes
because both sides of the divide are equally biased.
Reproduce the calibration with the snippet in this commit's
code_review_graph/context_savings.py:verify_with_tiktoken, or
inline-run the --verify flag on any commit.
| Reproducible | Reason |
|---|---|
| Tree-sitter parsing | Pure function of input bytes |
| Node / edge counts | Deterministic upserts keyed by qualified_name |
| FTS5 BM25 scores | Deterministic |
Embeddings via all-MiniLM-L6-v2 on CPU |
Model weights cache-pinned by SHA in HuggingFace cache |
| Leiden community IDs | Seeded — _LEIDEN_SEED=42 in communities.py, override with CRG_LEIDEN_SEED env var |
naive_corpus_tokens |
Deterministic for a fixed git checkout |
git clone at a pinned SHA |
Determines the source-of-truth byte stream |
What used to make it non-reproducible (now fixed):
commit: HEADin everyeval/configs/*.yaml— replaced with the pinned latest test-commit SHA per repogit clone --depth 50silently fell back to wrong commits when the pinned SHAs were beyond the shallow window — now uses full clones with explicitreturncodechecks- Leiden ran with an unseeded RNG — now seeded
nextjs.yamlwas a misnamed config evaluating this repo — renamed tocode-review-graph.yaml- FTS5 was created but never populated by the eval framework's
full_buildcall —eval/runner.pynow callspostprocessing.run_post_processingdirectly
- Python 3.10 or newer
giton PATH- Network access (~600 MB to clone the 6 upstream repos)
- ~3 GB free disk
- For the embedding step: roughly 700 MB extra for
torch+sentence-transformers
git clone https://github.com/tirth8205/code-review-graph
cd code-review-graph
# eval extras: pyyaml + matplotlib (matplotlib only needed for `--report`)
# embeddings extras: sentence-transformers + numpy
uv sync --extra eval --extra embeddings # or: pip install -e ".[eval,embeddings]"This step clones 6 upstream repositories at pinned SHAs, builds a full graph
for each (parser + cross-file resolvers + signatures + FTS5 + flows + Leiden
communities), then runs the token_efficiency, impact_accuracy, and
multi_hop_retrieval benchmarks.
uv run code-review-graph eval --benchmark token_efficiency,impact_accuracy,multi_hop_retrievalExpected runtime on an M1/M2 Mac: roughly 8–15 minutes for the build phase, plus seconds per benchmark.
Outputs:
evaluate/test_repos/{express,fastapi,flask,gin,httpx,code-review-graph}/evaluate/test_repos/<name>/.code-review-graph/graph.dbevaluate/results/<name>_<benchmark>_<date>.csv
The standalone token benchmark ships with 5 hardcoded natural-language questions. Without embeddings, hybrid search can't match them and the benchmark silently returns 0× reduction ratios (a loud warning will print).
for repo in express fastapi flask gin httpx code-review-graph; do
uv run code-review-graph embed --repo "evaluate/test_repos/$repo"
doneExpected runtime: 2–5 minutes total. Vectors live inside the same graph.db.
This benchmark compares all source-file tokens in the repo against 5 search hits + a few neighbor edges for each of 5 sample questions. The ratio answers: how many tokens does the graph let me skip on a typical question?
uv run python <<'PY'
import json
from pathlib import Path
from code_review_graph.graph import GraphStore
from code_review_graph.token_benchmark import run_token_benchmark
results = {}
for repo in sorted(Path("evaluate/test_repos").iterdir()):
db = repo / ".code-review-graph" / "graph.db"
if not db.exists():
continue
store = GraphStore(str(db))
try:
results[repo.name] = run_token_benchmark(store, repo)
finally:
store.close()
print(f"{'Repo':<22}{'naive_tokens':>16}{'avg_graph_tokens':>20}{'avg_ratio':>14}")
print("-" * 72)
for name, out in sorted(results.items(), key=lambda x: -x[1]["average_reduction_ratio"]):
pq = out["per_question"]
avg_graph = int(sum(r["graph_tokens"] for r in pq) / max(len(pq), 1))
print(f"{name:<22}{out['naive_corpus_tokens']:>16,}"
f"{avg_graph:>20,}{out['average_reduction_ratio']:>13.1f}×")
Path("evaluate/standalone_token_benchmark.json").write_text(json.dumps(results, indent=2))
PYCaptured 2026-05-25 on macOS arm64, Python 3.11, sentence-transformers 5.5.1,
all-MiniLM-L6-v2, CRG_LEIDEN_SEED=42. If your numbers differ by more than
rounding, something in the chain has drifted — file an issue.
Each row is the average of 5 sample questions (how does authentication work,
what is the main entry point, how are database connections managed,
what error handling patterns are used, how do tests verify core functionality).
| Repo | snapshot SHA | naive_corpus_tokens | avg graph_tokens | avg ratio |
|---|---|---|---|---|
| fastapi | 0227991a |
951,071 | 2,169 | 528.4× |
| code-review-graph | 84bde354 |
208,821 | 2,495 | 93.0× |
| gin | 5c00df8a |
166,868 | 1,990 | 91.8× |
| flask | a29f88ce |
125,022 | 1,986 | 71.4× |
| express | b4ab7d65 |
135,955 | 3,465 | 40.6× |
| httpx | b55d4635 |
89,492 | 2,438 | 38.0× |
Range across 6 repos: 38× – 528×. The numbers shifted down from a
previous capture because (a) the test repos are now wiped/re-cloned from
scratch — no leftover build artifacts or local caches inflate the naive
baseline; and (b) the embedding text per node became richer in this same
release (see embeddings._node_to_text), so the graph response itself is
slightly bigger. Both are correctness improvements over the prior numbers.
A different denominator: just the changed-file content for each commit,
vs the full get_review_context() JSON. For small commits the response is
larger than the input (it carries impact-radius edges + source snippets), so
ratios here are intentionally < 1.0 — that is not a bug, it measures a
different thing than the standalone benchmark.
Raw per-commit CSVs in evaluate/results/<repo>_token_efficiency_*.csv.
13 commits across 6 repos.
| Metric | Value |
|---|---|
| Recall (mean across 13 commits) | 1.000 (100% on every commit) |
| F1 (mean) | 0.714 |
| F1 (median) | 0.667 |
| F1 (min / max) | 0.455 / 1.000 |
The blast-radius analysis over-predicts in some commits (precision ≈ 0.30 in the worst case, where 34 files are flagged for a 10-file change). That is intentional: a missed dependency is worse than an extra reviewed file.
11 hand-curated tasks across the 6 repos. Each task is a 2-step tool chain:
hybrid_search(nl_query, limit=10)looks for a starting anchor node.query_graph(<traversal_pattern>, target=<anchor>)walks one hop alongcallers_of/callees_of/tests_for/imports_of/ etc.
The task scores 1.0 only if both the anchor is found in the top-K and
the expected neighbor names are returned by the traversal. Scores 0.0
otherwise (which collapses both "search missed the anchor" and "traversal
returned the wrong set" — split those by inspecting anchor_found and
neighbor_recall in the per-task CSV row).
| Repo | Task | Anchor found | Rank | Neighbor recall | Score |
|---|---|---|---|---|---|
| code-review-graph | crg-parse-file-callers | yes | 0 | 1.00 | 1.00 |
| code-review-graph | crg-upsert-node-callers | yes | 4 | 1.00 | 1.00 |
| express | express-create-application-callees | yes | 1 | 1.00 | 1.00 |
| fastapi | fastapi-route-handler-callers | yes | 6 | 1.00 | 1.00 |
| fastapi | fastapi-get-dependant-callers | no | — | 0.00 | 0.00 |
| flask | flask-dispatch-callers | yes | 3 | 1.00 | 1.00 |
| flask | flask-exception-callers | yes | 5 | 1.00 | 1.00 |
| gin | gin-serve-http-callees | yes | 5 | 1.00 | 1.00 |
| gin | gin-context-next-callers | yes | 0 | 1.00 | 1.00 |
| httpx | httpx-client-request-callers | yes | 0 | 1.00 | 1.00 |
| httpx | httpx-async-request-tests | yes | 7 | 1.00 | 1.00 |
Average score across 11 tasks: 0.909. 10/11 tasks pass; the one remaining
miss (fastapi-get-dependant-callers) targets a function spelled get_dependant
("dependant" with an a) from a query phrased as "dependency declarations into
a tree" — there is no lexical overlap and no extractable identifier in the
query for the boosting heuristic to lock onto. Left as an honest miss; the
fix would be either query rewriting or a richer embedding model.
The v1 scaffold first scored 0.545 (6/11). Two changes brought it to 0.909 (10/11), both deterministic, both small, both committed in this same session:
-
embeddings.py:_node_to_text— the embedded text per node used to be just"{name} {kind} in {parent}". It now also includes the dotted form (APIRoute.get_route_handler), the identifier split into words (get route handler), and the enclosing module directory (routing,fastapi,dependencies). All re-embeddings are automatic — the text hash changes,EmbeddingStore.embed_nodesre-embeds. See_split_identifierfor the casing/separator rules. -
search.py:extract_query_identifiers— natural-language queries like "Who advances the gin middleware chain via Context.Next" now have their dotted / snake_case / CamelCase identifier tokens extracted. Search results whosequalified_namecontains any extracted identifier get a 2.0× boost. This pushedContext.Nextfrom rank 11 to rank 0.
The remaining fastapi-get-dependant-callers failure cannot be fixed by
either change because the query doesn't share any identifier or substring
with the target — that's the boundary of the heuristic.
This benchmark is a v1 scaffold (11 tasks). The intent is to track the
multi-hop tool chain as the agent's actual usage pattern rather than just
single-shot retrieval. Adding more tasks: append multi_hop_tasks: entries
to any config under code_review_graph/eval/configs/*.yaml with the schema:
multi_hop_tasks:
- id: my-task-id # required, unique
nl_query: "natural language" # required, what an agent would ask
anchor_qualified_suffix: # required, lowercased suffix of expected
"rel/path.py::owner.symbol" # qualified_name (case-insensitive endswith)
traversal_pattern: callers_of # one of callers_of|callees_of|imports_of|
# importers_of|tests_for|inheritors_of|children_of
expected_neighbor_names: # required, list of bare names that should
- "expected_one" # appear in the traversal result
k: 10 # optional, top-K depth for the search step| Repo | Nodes | Edges | Flows | Communities | Embeddings | FTS idx rows |
|---|---|---|---|---|---|---|
| fastapi | 6,292 | 32,081 | 165 | 85 | 5,164 | 127 |
| express | 1,912 | 18,877 | 4 | 7 | 1,771 | 47 |
| gin | 1,589 | 17,237 | 114 | 41 | 1,491 | 29 |
| code-review-graph | 1,418 | 8,877 | 104 | 11 | 1,326 | 38 |
| flask | 1,415 | 8,259 | 78 | 13 | 1,329 | 35 |
| httpx | 1,261 | 8,228 | 128 | 5 | 1,193 | 34 |
Embeddings count is lower than node count because File nodes aren't embedded. FTS idx rows are far lower than node count because FTS5 stores inverted-index segments, not one row per indexed document.
There are three different "token" benchmarks in the repo. They are all valid but measure different scenarios:
| Benchmark | Naive baseline | Graph cost | Question answered |
|---|---|---|---|
eval/benchmarks/token_efficiency.py |
sum of changed-file content for a specific commit | full get_review_context() JSON |
"Is the graph cheaper than just reading the diffed files?" |
eval/token_benchmark.py |
none — absolute per-workflow cost | sum of 5 MCP-tool responses | "How many tokens does a complete agent workflow cost?" |
code_review_graph/token_benchmark.py (standalone) |
sum of all source files in repo | 5 search hits + 5 neighbor edges per question | "Is the graph cheaper than reading the whole repo?" |
The eval/benchmarks/token_efficiency.py numbers can be less than 1.0×
for small commits (get_review_context carries impact-radius metadata and
source snippets, which outweigh a tiny changed-file set). The standalone
benchmark numbers are always large because the baseline is the entire
repo. Pick the one that matches the scenario you're talking about.
The 9 diagrams in diagrams/ are produced from diagrams/generate_diagrams.py.
Excalidraw source files (.excalidraw) are gitignored (*.excalidraw line in
.gitignore); only the rendered PNGs are tracked. Regenerate after a
benchmark refresh:
uv run python diagrams/generate_diagrams.py
# Open each .excalidraw at https://excalidraw.com to render/exportgit clone failed — Network or upstream rate-limit. The fix is a clean
retry; the eval doesn't auto-retry by design (loud failures > silent
fallback).
git checkout <sha> failed — Upstream rewrote history or removed the
SHA. File an issue with the failing config so we can re-pin.
No embeddings found in this graph warning during the standalone
benchmark — you skipped Step 3. Run it.
Different community IDs between runs — Make sure you're on the seeded
communities.py. Check grep _LEIDEN_SEED code_review_graph/communities.py.
You can override the seed via CRG_LEIDEN_SEED=<int> but all collaborators
must agree on the same value.
Different naive_corpus_tokens than the canonical table — Make sure
git rev-parse HEAD inside each evaluate/test_repos/<name> matches the
commit: field in the corresponding config file. If not, delete the clone
and let Step 2 re-clone at the pinned SHA.