Skip to content

fix(benchmarks,scripts): rescue 2 Copilot fixes from duplicate PR #229#235

Open
dgenio wants to merge 2 commits into
mainfrom
claude/pr-229-followup-cherry-picks
Open

fix(benchmarks,scripts): rescue 2 Copilot fixes from duplicate PR #229#235
dgenio wants to merge 2 commits into
mainfrom
claude/pr-229-followup-cherry-picks

Conversation

@dgenio
Copy link
Copy Markdown
Owner

@dgenio dgenio commented May 17, 2026

Summary

Re-applies the two Copilot-suggested fixes from PR #229's review-fix commit that remained applicable on top of main after the duplicate PR #228 merged. The other four Copilot-suggested fixes from #229 turned out to already be correct on main via #228's implementation.

PR #229 itself is a structural duplicate of merged PR #228 (both branches independently implemented the same #207-#215 issue group). A direct rebase surfaces 12+ files in deep conflict because the same logical work was implemented differently on each branch. Recommend closing #229 as a duplicate (see the comment posted there).

Changes

1. benchmarks/benchmark.py--backends typo validation (PR #229 Copilot #5)

Typos like --backends tfidf,tifdf previously reached Router(scorer_backend=) and surfaced as a ConfigError traceback. Now _parse_args validates against _SUPPORTED_BACKENDS = {tfidf, bm25, fuzzy} and exits via parser.error() with argparse's standard code-2 convention.

Tests added in tests/test_benchmark.py:

  • test_parse_args_rejects_unknown_backend — code-2 exit on typo
  • test_parse_args_accepts_all_supported_backends
  • test_parse_args_accepts_subset_of_backends

2. scripts/benchmark_delta.py — skipped-row rendering (PR #229 Copilot #4)

When benchmarks/benchmark.py --matrix emits a (backend, size) cell with status: "skipped: rapidfuzz not installed" (zeroed metrics by design), the previous delta script treated those zeroes as a real accuracy/latency regression and emitted ⚠️ markers on every PR. Now _render_matrix_section short-circuits on a non-empty status and emits a single _skipped_ (<reason>) row.

Tests added in new tests/test_benchmark_delta.py:

  • test_matrix_delta_renders_skipped_row_for_status_bearing_cell — pins no ⚠️ on skipped cells
  • test_matrix_delta_real_cells_still_render_metrics — status-less cells unchanged
  • test_matrix_delta_empty_payload_returns_empty_string — empty-input safety

What did NOT need re-applying (already correct on main via #228)

PR #229 Copilot # Concern Already-on-main status
#8 int casts on naive_tokens / cw_tokens scripts/baseline_naive.py:141-142 already emits ints
#7 dead if default_row is not None and not default_row.is_default: pass branch Main's sweep_scoring.py never had it
#2 / #3 matrix/per_namespace location in README + CHANGELOG Main's benchmarks/README.md:136 already says "additive top-level keys"
#1 _DELIM_CHARS missing underscore Was a false positive in PR #229 (acknowledged in the original review-fix commit)
#6 sweep_scoring.py size Acknowledged in PR #229 as following the existing scripts/ precedent

Verification

ruff format src/ tests/ examples/ scripts/   →  clean
ruff check  src/ tests/ examples/ scripts/   →  All checks passed
mypy src/                                    →  Success: 0 issues / 72 source files
pytest -q                                    →  1011 passed, 5 skipped (+ 6 new regression tests)
make example                                 →  clean
make demo                                    →  Demo complete
make scorecard-check                         →  clean

Related


Generated by Claude Code

PR #229 was a duplicate of merged PR #228 — both branches independently
implemented the same #207-#215 issue group, making a direct rebase
unworkable (12+ files in deep conflict). This commit re-applies the two
of the six Copilot-suggested fixes from PR #229's review-fix commit that
remained applicable on top of main's #228-merged implementation; the
other four (int casts in baseline_naive, dead branch in sweep_scoring,
top-level matrix/per_namespace doc location in README + CHANGELOG) are
already correct on main.

1. **benchmark.py --backends validation** (PR #229 Copilot #5). Typos
   like `--backends tfidf,tifdf` previously reached `Router(scorer_backend=)`
   and surfaced as a `ConfigError` traceback. Now `_parse_args` validates
   against `_SUPPORTED_BACKENDS = {tfidf, bm25, fuzzy}` and exits via
   `parser.error()` with argparse's standard code-2 convention.
   Tests: `test_parse_args_rejects_unknown_backend`,
   `test_parse_args_accepts_all_supported_backends`,
   `test_parse_args_accepts_subset_of_backends`.

2. **benchmark_delta.py skipped-row rendering** (PR #229 Copilot #4).
   When `benchmarks/benchmark.py --matrix` emits a `(backend, size)` cell
   with `status: "skipped: rapidfuzz not installed"` (zeroed metrics by
   design), the previous delta script treated those zeroes as a real
   accuracy/latency regression and rendered ⚠️ markers on every PR.
   Now `_render_matrix_section` short-circuits on a non-empty `status`
   and emits a single `_skipped_ (<reason>)` row. Tests covering both
   the skipped-row path and the unchanged status-less path live in the
   new `tests/test_benchmark_delta.py`.

This is the surviving piece of PR #229's PR-review-cycle work. PR #229
itself should be closed as a structural duplicate; this commit lands on
a fresh branch (`claude/pr-229-followup-cherry-picks`) off current main.

Verification:
- ruff format/lint clean
- mypy strict — 0 issues / 72 source files
- pytest -q — 1011 passed, 5 skipped
- make example + make demo clean
- make scorecard-check clean
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Re-applies two fixes from a duplicate PR: (1) validate benchmarks/benchmark.py --backends early via argparse conventions, and (2) adjust scripts/benchmark_delta.py matrix rendering to avoid treating “skipped” cells as regressions, with accompanying regression tests.

Changes:

  • Add _SUPPORTED_BACKENDS validation in benchmarks/benchmark.py::_parse_args() to fail fast on unknown backend names via parser.error() (exit code 2).
  • Update scripts/benchmark_delta.py::_render_matrix_section() to short-circuit rendering when a matrix cell carries a status.
  • Add/extend tests for _parse_args() and the matrix-delta “skipped cell” behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
benchmarks/benchmark.py Adds supported-backend allowlist and argparse-level validation for --backends.
scripts/benchmark_delta.py Attempts to suppress false-positive regressions by treating status-bearing cells as skipped.
tests/test_benchmark.py Adds regression tests for _parse_args() backend validation behavior.
tests/test_benchmark_delta.py Adds regression tests for skipped-cell rendering in matrix delta output.

Comment on lines +139 to +148
# Skip-cells carry zeroed metrics by design (e.g. fuzzy with no
# rapidfuzz). Treating them as accuracy/latency regressions would
# produce false-positive ⚠️ markers on every PR; emit a single
# "skipped" row that surfaces the reason text instead.
head_status = str(h.get("status", ""))
base_status = str(b.get("status", ""))
if head_status or base_status:
reason = head_status or base_status
lines.append(f"| {backend} | {size} | _skipped_ ({reason}) | — | — |")
continue
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed and fixed in fad6462. You were exactly right — MatrixCell.status: str = "ok" (benchmark.py:184) so my if head_status or base_status: gate fired on every real cell. The github-actions benchmark-delta bot caught it live on this PR: every (backend, size) cell rendered as _skipped_ (ok) instead of its metrics.

The gate is now status != "ok", aligned with the existing convention in scripts/render_scorecard.py:150-151. Missing-status cells are treated as "ok" for backwards compat.

On the "availability improvements" suggestion (base-skipped/head-ok): I kept the simpler "either side is non-ok → render as skipped" treatment for now. A head-only metric without a base would show as recall@k 0.45 (+0.45) which is misleading (not a real delta) — surfacing the more nuanced rendering is a worthwhile but separate improvement. Happy to follow up if you'd like that in this PR.

Resolving thread.


Generated by Claude Code

Comment thread tests/test_benchmark_delta.py Outdated
Comment on lines +55 to +83
def test_matrix_delta_real_cells_still_render_metrics() -> None:
"""Status-less cells still produce a full recall/MRR/latency row."""
base = _payload(
[
{
"backend": "tfidf",
"catalog_size": 100,
"recall_at_k": 0.5,
"mrr": 0.4,
"latency_ms_p99": 1.0,
}
]
)
head = _payload(
[
{
"backend": "tfidf",
"catalog_size": 100,
"recall_at_k": 0.5,
"mrr": 0.4,
"latency_ms_p99": 1.0,
}
]
)
out = _render_matrix_section(base, head)
# Numeric markers present, no skipped marker, no regression marker.
assert "_skipped_" not in out
assert "0.5000" in out

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fad6462. The fixtures now explicitly carry status: "ok":

  • test_matrix_delta_ok_cells_render_metrics (renamed from the misleading _real_cells_still_render_metrics) — sets status: "ok" on both base and head and asserts metrics render (0.5000 present, no _skipped_). This is exactly the regression test that would have caught my original bug.
  • test_matrix_delta_missing_status_treated_as_ok — backwards-compat for cells with no status field at all.

Verified locally: pytest -q now reports 1012 passed (one more test than before the fix).

Resolving thread.


Generated by Claude Code

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 17, 2026

Benchmark delta (vs main)

Soft regression feedback only — this comment never blocks the PR.
Latency budget: ⚠️ when head > base × 1.3. Accuracy budget: ⚠️ when head < base - 1pp.

Routing summary (single backend × catalog sizes)

size recall@k (head Δ vs base) MRR (head Δ vs base) p99 (ms)
50 ✅ 0.5649 (+0.0000) ✅ 0.4978 (+0.0000) ✅ 0.499 (base 0.463)
83 ✅ 0.3825 (+0.0000) ✅ 0.3242 (+0.0000) ⚠️ 1.179 (base 0.876)
1000 ✅ 0.1475 (+0.0000) ✅ 0.1456 (+0.0000) ✅ 39.815 (base 31.897)

Per-backend × per-size matrix

backend size recall@k (Δ) MRR (Δ) p99 (ms)
bm25 100 ✅ 0.3825 (+0.0000) ✅ 0.3399 (+0.0000) ✅ 5.896 (base 5.642)
bm25 500 ✅ 0.2250 (+0.0000) ✅ 0.2165 (+0.0000) ✅ 27.958 (base 27.538)
bm25 1000 ✅ 0.1575 (+0.0000) ✅ 0.1525 (+0.0000) ✅ 86.709 (base 78.368)
fuzzy 100 skipped (skipped: missing rapidfuzz)
fuzzy 500 skipped (skipped: missing rapidfuzz)
fuzzy 1000 skipped (skipped: missing rapidfuzz)
tfidf 100 ✅ 0.3825 (+0.0000) ✅ 0.3220 (+0.0000) ✅ 0.991 (base 0.872)
tfidf 500 ✅ 0.2325 (+0.0000) ✅ 0.2314 (+0.0000) ✅ 8.804 (base 8.660)
tfidf 1000 ✅ 0.1475 (+0.0000) ✅ 0.1456 (+0.0000) ✅ 33.536 (base 30.071)

Context pipeline (per scenario)

scenario tokens dropped dedup
large_catalog 1514 (base 1514, Δ+0) 0 (base 0, Δ+0) 0 (base 0, Δ+0)
long_conversation 2548 (base 2548, Δ+0) 0 (base 0, Δ+0) 0 (base 0, Δ+0)
short_conversation 496 (base 496, Δ+0) 0 (base 0, Δ+0) 0 (base 0, Δ+0)
stress_conversation 6651 (base 6651, Δ+0) 7 (base 7, Δ+0) 4 (base 4, Δ+0)

Numbers come from make benchmark / make benchmark-matrix.
Latency is hardware-dependent — treat the markers as a rough guide.
See benchmarks/scorecard.md for the full picture.

Addresses both Copilot inline comments on PR #235.

The previous fix used `if head_status or base_status:` to detect skipped
matrix cells, but `benchmarks/benchmark.py::MatrixCell.status` defaults
to `"ok"` for every real cell — making `"ok"` truthy and causing the
delta script to render `_skipped_ (ok)` for every backend/size cell.
The benchmark-delta bot comment on this PR caught the bug live: all 9
real cells (bm25/fuzzy/tfidf × 100/500/1000) were marked as skipped.

The gate is now `status != "ok"`, mirroring the existing convention in
`scripts/render_scorecard.py:150-151`. Cells with no `status` field are
treated as "ok" for backwards compat.

Test updates:
- `test_matrix_delta_ok_cells_render_metrics` (renamed from
  `test_matrix_delta_real_cells_still_render_metrics`) now explicitly
  sets `status: "ok"` on the fixtures and asserts metrics render — this
  is the regression test that would have caught the original bug.
- `test_matrix_delta_missing_status_treated_as_ok` added for backwards
  compatibility on cells that omit `status` entirely.

Verification:
- ruff format/lint/mypy clean
- pytest -q: 1012 passed, 5 skipped (one more test than before the fix)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants