Skip to content

Latest commit

 

History

History
367 lines (287 loc) · 16.7 KB

File metadata and controls

367 lines (287 loc) · 16.7 KB

Reproducing the Benchmarks

This document gives the exact commands to reproduce every benchmark number shown in the README and the diagrams/. Two people running the recipe below on different machines on different days should produce identical numbers, within float rounding.

If you get different numbers, that's a bug — please file an issue.

Verifying the "saved tokens" number

The CLI's Token Savings panel uses a chars / 4 approximation labelled estimated: true, not a model-specific tokenizer. The approximation is designed to be both fast (no model load, no inference) and conservative.

How to verify against a real tokenizer

pip install tiktoken
code-review-graph detect-changes --brief --verify

The panel grows a Verified (tiktoken) row showing the same calculation done with OpenAI's cl100k_base tokenizer (the GPT-4 family). If the estimate is significantly off, you'll see it immediately:

┌───────────────────────── Token Savings ─────────────────────────┐
│ Full context would be:     12,921 tokens                        │
│ Graph context used:           762 tokens                        │
│ Saved:                     12,159 tokens (~94%)                 │
│ Verified (tiktoken):       10,835 tokens (~93%)  [11,611 → 776] │
│ Breakdown: Functions 244 · Tests 191 · Risk 244 · Other 83      │
└─────────────────────────────────────────────────────────────────┘

Calibration result (committed)

A one-time calibration across 222 files / 2.2 MB of mixed source (Python, JS, TS, Go, Rust, RST, MD) pulled from the 6 test repos:

Repo sample files bytes chars/4 estimate tiktoken real ratio est/real
flask 46 470,179 117,559 109,969 1.069
fastapi 38 156,224 39,072 34,897 1.120
gin 30 471,793 117,962 132,296 0.892
express 23 296,805 74,207 83,575 0.888
httpx 38 254,184 63,556 62,909 1.010
code-review-graph 47 539,206 134,820 120,760 1.116
OVERALL 222 2,188,391 547,176 544,406 1.005

chars / 4 is within +0.5% of real GPT-4 tokens in aggregate. Per-repo it swings between -11% (gin: lots of short Go identifiers) and +12% (fastapi: heavy docstrings and type hints), but the ratio stabilizes because both sides of the divide are equally biased.

Reproduce the calibration with the snippet in this commit's code_review_graph/context_savings.py:verify_with_tiktoken, or inline-run the --verify flag on any commit.

What is and isn't deterministic

Reproducible Reason
Tree-sitter parsing Pure function of input bytes
Node / edge counts Deterministic upserts keyed by qualified_name
FTS5 BM25 scores Deterministic
Embeddings via all-MiniLM-L6-v2 on CPU Model weights cache-pinned by SHA in HuggingFace cache
Leiden community IDs Seeded — _LEIDEN_SEED=42 in communities.py, override with CRG_LEIDEN_SEED env var
naive_corpus_tokens Deterministic for a fixed git checkout
git clone at a pinned SHA Determines the source-of-truth byte stream

What used to make it non-reproducible (now fixed):

  • commit: HEAD in every eval/configs/*.yaml — replaced with the pinned latest test-commit SHA per repo
  • git clone --depth 50 silently fell back to wrong commits when the pinned SHAs were beyond the shallow window — now uses full clones with explicit returncode checks
  • Leiden ran with an unseeded RNG — now seeded
  • nextjs.yaml was a misnamed config evaluating this repo — renamed to code-review-graph.yaml
  • FTS5 was created but never populated by the eval framework's full_build call — eval/runner.py now calls postprocessing.run_post_processing directly

Prerequisites

  • Python 3.10 or newer
  • git on PATH
  • Network access (~600 MB to clone the 6 upstream repos)
  • ~3 GB free disk
  • For the embedding step: roughly 700 MB extra for torch + sentence-transformers

Step 1 — Install with the right extras

git clone https://github.com/tirth8205/code-review-graph
cd code-review-graph

# eval extras: pyyaml + matplotlib (matplotlib only needed for `--report`)
# embeddings extras: sentence-transformers + numpy
uv sync --extra eval --extra embeddings     # or: pip install -e ".[eval,embeddings]"

Step 2 — Run the formal eval

This step clones 6 upstream repositories at pinned SHAs, builds a full graph for each (parser + cross-file resolvers + signatures + FTS5 + flows + Leiden communities), then runs the token_efficiency, impact_accuracy, and multi_hop_retrieval benchmarks.

uv run code-review-graph eval --benchmark token_efficiency,impact_accuracy,multi_hop_retrieval

Expected runtime on an M1/M2 Mac: roughly 8–15 minutes for the build phase, plus seconds per benchmark.

Outputs:

  • evaluate/test_repos/{express,fastapi,flask,gin,httpx,code-review-graph}/
  • evaluate/test_repos/<name>/.code-review-graph/graph.db
  • evaluate/results/<name>_<benchmark>_<date>.csv

Step 3 — Generate embeddings (required for the standalone benchmark)

The standalone token benchmark ships with 5 hardcoded natural-language questions. Without embeddings, hybrid search can't match them and the benchmark silently returns 0× reduction ratios (a loud warning will print).

for repo in express fastapi flask gin httpx code-review-graph; do
  uv run code-review-graph embed --repo "evaluate/test_repos/$repo"
done

Expected runtime: 2–5 minutes total. Vectors live inside the same graph.db.

Step 4 — Run the standalone token benchmark

This benchmark compares all source-file tokens in the repo against 5 search hits + a few neighbor edges for each of 5 sample questions. The ratio answers: how many tokens does the graph let me skip on a typical question?

uv run python <<'PY'
import json
from pathlib import Path
from code_review_graph.graph import GraphStore
from code_review_graph.token_benchmark import run_token_benchmark

results = {}
for repo in sorted(Path("evaluate/test_repos").iterdir()):
    db = repo / ".code-review-graph" / "graph.db"
    if not db.exists():
        continue
    store = GraphStore(str(db))
    try:
        results[repo.name] = run_token_benchmark(store, repo)
    finally:
        store.close()

print(f"{'Repo':<22}{'naive_tokens':>16}{'avg_graph_tokens':>20}{'avg_ratio':>14}")
print("-" * 72)
for name, out in sorted(results.items(), key=lambda x: -x[1]["average_reduction_ratio"]):
    pq = out["per_question"]
    avg_graph = int(sum(r["graph_tokens"] for r in pq) / max(len(pq), 1))
    print(f"{name:<22}{out['naive_corpus_tokens']:>16,}"
          f"{avg_graph:>20,}{out['average_reduction_ratio']:>13.1f}×")

Path("evaluate/standalone_token_benchmark.json").write_text(json.dumps(results, indent=2))
PY

Canonical numbers

Captured 2026-05-25 on macOS arm64, Python 3.11, sentence-transformers 5.5.1, all-MiniLM-L6-v2, CRG_LEIDEN_SEED=42. If your numbers differ by more than rounding, something in the chain has drifted — file an issue.

Standalone token benchmark (code_review_graph/token_benchmark.py)

Each row is the average of 5 sample questions (how does authentication work, what is the main entry point, how are database connections managed, what error handling patterns are used, how do tests verify core functionality).

Repo snapshot SHA naive_corpus_tokens avg graph_tokens avg ratio
fastapi 0227991a 951,071 2,169 528.4×
code-review-graph 84bde354 208,821 2,495 93.0×
gin 5c00df8a 166,868 1,990 91.8×
flask a29f88ce 125,022 1,986 71.4×
express b4ab7d65 135,955 3,465 40.6×
httpx b55d4635 89,492 2,438 38.0×

Range across 6 repos: 38× – 528×. The numbers shifted down from a previous capture because (a) the test repos are now wiped/re-cloned from scratch — no leftover build artifacts or local caches inflate the naive baseline; and (b) the embedding text per node became richer in this same release (see embeddings._node_to_text), so the graph response itself is slightly bigger. Both are correctness improvements over the prior numbers.

Formal token_efficiency benchmark (eval/benchmarks/token_efficiency.py)

A different denominator: just the changed-file content for each commit, vs the full get_review_context() JSON. For small commits the response is larger than the input (it carries impact-radius edges + source snippets), so ratios here are intentionally < 1.0 — that is not a bug, it measures a different thing than the standalone benchmark.

Raw per-commit CSVs in evaluate/results/<repo>_token_efficiency_*.csv.

Impact accuracy (eval/benchmarks/impact_accuracy.py)

13 commits across 6 repos.

Metric Value
Recall (mean across 13 commits) 1.000 (100% on every commit)
F1 (mean) 0.714
F1 (median) 0.667
F1 (min / max) 0.455 / 1.000

The blast-radius analysis over-predicts in some commits (precision ≈ 0.30 in the worst case, where 34 files are flagged for a 10-file change). That is intentional: a missed dependency is worse than an extra reviewed file.

Multi-hop retrieval (eval/benchmarks/multi_hop_retrieval.py)

11 hand-curated tasks across the 6 repos. Each task is a 2-step tool chain:

  1. hybrid_search(nl_query, limit=10) looks for a starting anchor node.
  2. query_graph(<traversal_pattern>, target=<anchor>) walks one hop along callers_of / callees_of / tests_for / imports_of / etc.

The task scores 1.0 only if both the anchor is found in the top-K and the expected neighbor names are returned by the traversal. Scores 0.0 otherwise (which collapses both "search missed the anchor" and "traversal returned the wrong set" — split those by inspecting anchor_found and neighbor_recall in the per-task CSV row).

Repo Task Anchor found Rank Neighbor recall Score
code-review-graph crg-parse-file-callers yes 0 1.00 1.00
code-review-graph crg-upsert-node-callers yes 4 1.00 1.00
express express-create-application-callees yes 1 1.00 1.00
fastapi fastapi-route-handler-callers yes 6 1.00 1.00
fastapi fastapi-get-dependant-callers no 0.00 0.00
flask flask-dispatch-callers yes 3 1.00 1.00
flask flask-exception-callers yes 5 1.00 1.00
gin gin-serve-http-callees yes 5 1.00 1.00
gin gin-context-next-callers yes 0 1.00 1.00
httpx httpx-client-request-callers yes 0 1.00 1.00
httpx httpx-async-request-tests yes 7 1.00 1.00

Average score across 11 tasks: 0.909. 10/11 tasks pass; the one remaining miss (fastapi-get-dependant-callers) targets a function spelled get_dependant ("dependant" with an a) from a query phrased as "dependency declarations into a tree" — there is no lexical overlap and no extractable identifier in the query for the boosting heuristic to lock onto. Left as an honest miss; the fix would be either query rewriting or a richer embedding model.

How the score went from 0.545 to 0.909 (the same-day fix)

The v1 scaffold first scored 0.545 (6/11). Two changes brought it to 0.909 (10/11), both deterministic, both small, both committed in this same session:

  1. embeddings.py:_node_to_text — the embedded text per node used to be just "{name} {kind} in {parent}". It now also includes the dotted form (APIRoute.get_route_handler), the identifier split into words (get route handler), and the enclosing module directory (routing, fastapi, dependencies). All re-embeddings are automatic — the text hash changes, EmbeddingStore.embed_nodes re-embeds. See _split_identifier for the casing/separator rules.

  2. search.py:extract_query_identifiers — natural-language queries like "Who advances the gin middleware chain via Context.Next" now have their dotted / snake_case / CamelCase identifier tokens extracted. Search results whose qualified_name contains any extracted identifier get a 2.0× boost. This pushed Context.Next from rank 11 to rank 0.

The remaining fastapi-get-dependant-callers failure cannot be fixed by either change because the query doesn't share any identifier or substring with the target — that's the boundary of the heuristic.

This benchmark is a v1 scaffold (11 tasks). The intent is to track the multi-hop tool chain as the agent's actual usage pattern rather than just single-shot retrieval. Adding more tasks: append multi_hop_tasks: entries to any config under code_review_graph/eval/configs/*.yaml with the schema:

multi_hop_tasks:
  - id: my-task-id                # required, unique
    nl_query: "natural language" # required, what an agent would ask
    anchor_qualified_suffix:     # required, lowercased suffix of expected
      "rel/path.py::owner.symbol" #   qualified_name (case-insensitive endswith)
    traversal_pattern: callers_of # one of callers_of|callees_of|imports_of|
                                  # importers_of|tests_for|inheritors_of|children_of
    expected_neighbor_names:      # required, list of bare names that should
      - "expected_one"            #   appear in the traversal result
    k: 10                         # optional, top-K depth for the search step

Build stats

Repo Nodes Edges Flows Communities Embeddings FTS idx rows
fastapi 6,292 32,081 165 85 5,164 127
express 1,912 18,877 4 7 1,771 47
gin 1,589 17,237 114 41 1,491 29
code-review-graph 1,418 8,877 104 11 1,326 38
flask 1,415 8,259 78 13 1,329 35
httpx 1,261 8,228 128 5 1,193 34

Embeddings count is lower than node count because File nodes aren't embedded. FTS idx rows are far lower than node count because FTS5 stores inverted-index segments, not one row per indexed document.

Which benchmark measures what

There are three different "token" benchmarks in the repo. They are all valid but measure different scenarios:

Benchmark Naive baseline Graph cost Question answered
eval/benchmarks/token_efficiency.py sum of changed-file content for a specific commit full get_review_context() JSON "Is the graph cheaper than just reading the diffed files?"
eval/token_benchmark.py none — absolute per-workflow cost sum of 5 MCP-tool responses "How many tokens does a complete agent workflow cost?"
code_review_graph/token_benchmark.py (standalone) sum of all source files in repo 5 search hits + 5 neighbor edges per question "Is the graph cheaper than reading the whole repo?"

The eval/benchmarks/token_efficiency.py numbers can be less than 1.0× for small commits (get_review_context carries impact-radius metadata and source snippets, which outweigh a tiny changed-file set). The standalone benchmark numbers are always large because the baseline is the entire repo. Pick the one that matches the scenario you're talking about.

Generating diagrams

The 9 diagrams in diagrams/ are produced from diagrams/generate_diagrams.py. Excalidraw source files (.excalidraw) are gitignored (*.excalidraw line in .gitignore); only the rendered PNGs are tracked. Regenerate after a benchmark refresh:

uv run python diagrams/generate_diagrams.py
# Open each .excalidraw at https://excalidraw.com to render/export

Troubleshooting

git clone failed — Network or upstream rate-limit. The fix is a clean retry; the eval doesn't auto-retry by design (loud failures > silent fallback).

git checkout <sha> failed — Upstream rewrote history or removed the SHA. File an issue with the failing config so we can re-pin.

No embeddings found in this graph warning during the standalone benchmark — you skipped Step 3. Run it.

Different community IDs between runs — Make sure you're on the seeded communities.py. Check grep _LEIDEN_SEED code_review_graph/communities.py. You can override the seed via CRG_LEIDEN_SEED=<int> but all collaborators must agree on the same value.

Different naive_corpus_tokens than the canonical table — Make sure git rev-parse HEAD inside each evaluate/test_repos/<name> matches the commit: field in the corresponding config file. If not, delete the clone and let Step 2 re-clone at the pinned SHA.