feat: v1.3.0 — retrieval metrics, judge calibration, online monitoring by lopadova · Pull Request #45 · padosoft/eval-harness

lopadova · 2026-06-16T18:34:35Z

Summary

Implements the 2026-06-16-eval-harness-enhancements plan as a single, additive v1.3.0 release train (the four improvements share MetricResolver, config/eval-harness.php, the service provider and README, so per-improvement branches were impractical; the plan explicitly sanctions folding all into v1.3.0). All v1 contracts are preserved.

Improvement 3 — Ordinal / distance metric

ordinal-distance: partial credit for ordered labels (exact 1.0 / off-by-one 0.5 / further 0.0), per-sample metadata.ordinal_scale override.

Improvement 4 — Retrieval-ranking metrics (first-class)

retrieval-hit-at-k, retrieval-recall-at-k, retrieval-mrr, retrieval-ndcg-at-k (binary or graded gains), answer-containment-at-k.
Shared RankedRetrieval parser/value object + AbstractRetrievalRankingMetric base.
metrics.retrieval.default_k config + per-sample metadata.k override. Aliases auto-wire from the container with zero extra binding (RetrievalAliasResolutionTest).

Improvement 2 — Judge calibration

eval-harness:calibrate-judge command: verdict agreement rate, confusion matrix, length-bias signal (Spearman), self-preference guard. Markdown/JSON output, CI gating.
HumanLabel, CalibrationCaseLoader, JudgeCalibrator, JudgeCalibrationReport; calibration.* config.

Improvement 1 — Online / production monitoring (off by default)

OnlineMonitor::capture(), OnlineSamplingDecision, queueable JudgeLiveSampleJob, first package migration (eval_harness_online_scores) + OnlineScore model.
OnlineTrendRepository, OnlineDriftAlert + OnlinePassRateDropped event.
Read-only GET /{prefix}/online/{dataset}/trend endpoint (eval-harness.report-api.v1.online-trend), eval-harness-migrations publish tag, online.* config.

Shared

RuntimeOptions::normalizeUnitInterval() clamp helper for [0,1] fractions.

Tests / gates (local)

composer validate --strict — valid.
vendor/bin/phpunit — OK (863 tests, 2335 assertions) Unit + (3 tests, 783 assertions) Architecture. (4 PHP 8.5 framework-internal deprecations, pre-existing.)
vendor/bin/phpstan analyse --memory-limit=512M — no errors.
vendor/bin/pint --test — passed.
Opt-in tests/Live/LiveOnlineMonitorTest.php self-skips without EVAL_HARNESS_LIVE_API_KEY.

Docs

README updated across all sections (features, metrics, API endpoints, three new usage sections, configuration, registry, roadmap).
REPORT_API_CONTRACT.md (online trend), CHANGELOG.md (v1.3.0), docs/PROGRESS.md, docs/LESSON.md.

Follow-ups

Tag v1.3.0 after merge (current latest v1.2.0); the AskMyDocs delegation plan pins ^1.3.0.
Companion eval-harness-admin needs the Online Monitoring screen consuming the new endpoint — separate repo/PR.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ring Improvement 3: ordinal-distance metric (partial credit for ordered labels). Improvement 4: retrieval-ranking family (hit@k, recall@k, MRR, nDCG@k, answer-containment@k) on a shared RankedRetrieval parser + base, with metrics.retrieval.default_k config and per-sample metadata.k override. Improvement 2: eval-harness:calibrate-judge command validating the LLM judge against human labels (verdict agreement, length-bias, self-preference). Improvement 1: online production monitoring (OnlineMonitor::capture, queued JudgeLiveSampleJob, eval_harness_online_scores migration + model, trend repository, drift alert event, read-only online/{dataset}/trend endpoint). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…oring Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 409a6cb014

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-16T18:40:38Z

+        if (! $this->writeOrPrintReport($payload)) {
+            return self::FAILURE;
+        }
+
+        return $this->finalize($report, $config);


Keep JSON output parseable before emitting diagnostics

When --json is used without --out, this writes the JSON report to stdout and then calls finalize(), which can append warn()/error() text for length bias, self-preference, or low agreement. In those failure/warning scenarios, CI or other callers that parse the advertised machine-readable JSON output receive extra non-JSON text after the object; run the gating diagnostics before printing, or send them to a separate stream / file path.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-16T18:40:38Z

+        // these rows are projections, not hydrated models.
+        $rows = OnlineScore::forDataset($dataset)
+            ->toBase()
+            ->selectRaw('DATE(judged_at) as day_bucket, count(*) as total_count, sum(passed) as passed_count')


Use a portable boolean aggregate for trends

In PostgreSQL-backed Laravel apps, passed is a boolean column and sum(passed) is not a valid aggregate, so the new online trend endpoint fails as soon as it queries stored scores. This works on SQLite/MySQL because booleans coerce to integers, but the package should use a portable expression such as a CASE WHEN passed THEN 1 ELSE 0 END sum for the passed_count.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-16T18:40:38Z

+ * @property array<string, mixed>|null $details
+ * @property Carbon $judged_at
+ */
+final class OnlineScore extends Model


Declare the database component for the new model

This new Eloquent model makes online monitoring depend on illuminate/database, but composer.json still declares only the other Illuminate components. In installs that rely on this package's component requirements rather than a full laravel/framework dependency, resolving OnlineScore or the online trend path will fail with a missing Illuminate\Database\Eloquent\Model; add the database component as a runtime dependency or avoid Eloquent for the optional store.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

This PR delivers the v1.3.0 additive release train for the eval-harness Laravel package, adding new scoring metrics (ordinal + retrieval-ranking), judge calibration tooling, and an opt-in online monitoring subsystem with a read-only trend API and first package migration.

Changes:

Added retrieval-ranking metrics (hit@k, recall@k, MRR, nDCG@k, answer containment) backed by a shared RankedRetrieval parser and configurable default_k.
Added judge calibration pipeline (eval-harness:calibrate-judge) including strict YAML case loading, calibration reporting, and CI gating signals (agreement/length-bias/self-preference).
Added online monitoring (sampling gate, queued judging job, drift alert event, persistence + trend API endpoint) plus RuntimeOptions::normalizeUnitInterval().

Reviewed changes

Copilot reviewed 56 out of 56 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/Unit/Support/RuntimeOptionsTest.php	Adds coverage for new unit-interval normalization behavior.
tests/Unit/ReportApi/Online/OnlineTrendControllerTest.php	Verifies online trend API envelope, aggregation, and path traversal handling.
tests/Unit/Online/OnlineTrendRepositoryTest.php	Tests per-day pass-rate aggregation and limiting behavior.
tests/Unit/Online/OnlineScoreModelTest.php	Validates model table name, casts, persistence, and dataset filtering helper.
tests/Unit/Online/OnlineSamplingDecisionTest.php	Tests config-driven sampling gate behavior with deterministic randomizer.
tests/Unit/Online/OnlineMonitorTest.php	Ensures capture dispatch behavior respects enablement and sampling rate.
tests/Unit/Online/OnlineDriftAlertTest.php	Tests drift detection logic and event dispatch thresholds.
tests/Unit/Online/JudgeLiveSampleJobTest.php	Validates queued job persists scores and pass/fail thresholding without provider calls.
tests/Unit/Metrics/RetrievalRecallAtKMetricTest.php	Covers recall@k scoring behavior.
tests/Unit/Metrics/RetrievalNdcgAtKMetricTest.php	Covers nDCG@k scoring for binary and graded relevance.
tests/Unit/Metrics/RetrievalMrrMetricTest.php	Covers MRR scoring behavior.
tests/Unit/Metrics/RetrievalHitAtKMetricTest.php	Covers hit@k scoring plus metadata k override behavior.
tests/Unit/Metrics/RetrievalAliasResolutionTest.php	Ensures retrieval metric aliases resolve via container without extra bindings.
tests/Unit/Metrics/Retrieval/RankedRetrievalTest.php	Tests retrieval JSON parsing, deduping, and expected-output gain handling.
tests/Unit/Metrics/OrdinalDistanceMetricTest.php	Covers ordinal-distance scoring and metadata scale override.
tests/Unit/Metrics/AnswerContainmentAtKMetricTest.php	Covers answer containment scoring over top-k retrieved texts.
tests/Unit/Console/CalibrateJudgeCommandTest.php	Ensures calibrate-judge command gates correctly and emits JSON output.
tests/Unit/Calibration/JudgeCalibratorTest.php	Validates agreement, confusion matrix, length bias correlation, and self-preference guard.
tests/Unit/Calibration/CalibrationCaseLoaderTest.php	Tests strict-schema calibration YAML validation and error cases.
tests/Live/LiveOnlineMonitorTest.php	Adds opt-in live test for end-to-end online capture against a real judge.
tests/Fixtures/calibration/judge-cases.v1.yaml	Adds calibration fixture YAML used by command/unit tests.
src/Support/RuntimeOptions.php	Adds `normalizeUnitInterval()` for safe [0,1] fraction parsing/clamping.
src/ReportApi/ReportApiSchema.php	Adds schema discriminator for the online trend endpoint.
src/ReportApi/Online/OnlineTrendResource.php	Adds JSON resource shaping for online trend responses.
src/ReportApi/Online/OnlineTrendController.php	Implements read-only online trend endpoint, limit parsing, and dataset validation.
src/Online/OnlineTrendRepository.php	Adds SQL aggregation for per-day online pass-rate points.
src/Online/OnlineScore.php	Introduces persisted online score model and typed dataset query starter.
src/Online/OnlineSamplingDecision.php	Adds config-driven sampling gate with injectable randomizer.
src/Online/OnlineMonitor.php	Adds host-app entrypoint to capture and dispatch online judging jobs.
src/Online/OnlineDriftAlert.php	Adds drift detection over a recent window and dispatches alert event.
src/Online/JudgeLiveSampleJob.php	Adds queued job to evaluate a sampled interaction, persist score, and re-check drift.
src/Online/Events/OnlinePassRateDropped.php	Adds drift event contract for host-app alert routing.
src/Metrics/RetrievalRecallAtKMetric.php	Implements recall@k metric.
src/Metrics/RetrievalNdcgAtKMetric.php	Implements nDCG@k metric.
src/Metrics/RetrievalMrrMetric.php	Implements MRR metric.
src/Metrics/RetrievalHitAtKMetric.php	Implements hit@k metric.
src/Metrics/Retrieval/RankedRetrieval.php	Adds shared retrieval parser/value object for ranked ids/texts.
src/Metrics/Retrieval/AbstractRetrievalRankingMetric.php	Adds shared base resolving k and parsing ranked + gains for retrieval metrics.
src/Metrics/OrdinalDistanceMetric.php	Implements ordinal-distance metric with per-sample scale override.
src/Metrics/MetricResolver.php	Extends built-in alias map to include new metrics.
src/Metrics/Metric.php	Updates metric interface doc to list new built-ins.
src/Metrics/AnswerContainmentAtKMetric.php	Adds answer containment@k metric over retrieved texts.
src/EvalHarnessServiceProvider.php	Registers calibration/online services, loads migrations, publishes migrations, and wires command.
src/Console/CalibrateJudgeCommand.php	Adds calibrate-judge command and report output handling.
src/Calibration/JudgeCalibrator.php	Implements calibration run logic, strict JSON decoding, and Spearman correlation.
src/Calibration/JudgeCalibrationReport.php	Adds immutable calibration report value object + array export.
src/Calibration/HumanLabel.php	Adds value object for human-labelled calibration cases.
src/Calibration/CalibrationCaseLoader.php	Adds strict-schema YAML loader for calibration cases.
routes/eval-harness-api.php	Adds the new online trend endpoint route.
README.md	Documents new metrics, calibration command, online monitoring, config blocks, and API endpoint.
docs/REPORT_API_CONTRACT.md	Documents the online trend API contract and payload shape.
docs/PROGRESS.md	Records implementation progress and local gate results for v1.3.0 work.
docs/LESSON.md	Captures implementation lessons around aggregates, PHPStan constraints, and migrations.
database/migrations/2026_06_16_000000_create_eval_harness_online_scores_table.php	Adds first package migration for online score persistence.
config/eval-harness.php	Adds retrieval, calibration, and online configuration blocks.
CHANGELOG.md	Adds v1.3.0 release notes entry.

+    private function random(): float
+    {
+        if ($this->randomizer !== null) {
+            return ($this->randomizer)();
+        }
+
+        return mt_rand() / mt_getrandmax();
+    }


+- Aggregates `eval_harness_online_scores` rows into per-day
+  (UTC date) pass-rate points, ascending by date. `threshold` echoes
+  `eval-harness.online.alert.threshold` so a dashboard can draw the
+  alert band.


- Declare illuminate/database (Eloquent OnlineScore model + migration). - Portable boolean aggregate: SUM(CASE WHEN passed THEN 1 ELSE 0 END) so the online trend works on PostgreSQL, not just MySQL/SQLite. - Drop redundant single-column dataset/judged_at indexes; the composite (dataset, judged_at) index covers every hot query. - Sampling draw stays in [0, 1): divide by (mt_getrandmax() + 1). - calibrate-judge: keep --json stdout parseable by suppressing the human diagnostic lines when JSON is written to stdout (data is in the payload). - Soften the online-trend contract wording to not promise UTC bucketing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 57 out of 57 changed files in this pull request and generated 2 comments.

+            // Single composite index covers every hot query
+            // (forDataset()->orderByDesc('judged_at') and the grouped
+            // trend aggregate); standalone dataset/judged_at indexes
+            // would be redundant write amplification.
+            $table->index(['dataset', 'judged_at']);


+## [1.3.0] - 2026-06-16
+
+Additive, backward-compatible feature set. All v1 contracts are preserved.
+
+### Added


- Add id to the online-scores composite index ((dataset, judged_at, id)) so the drift-alert ORDER BY judged_at DESC, id DESC tie-break is covered. - Move the 1.3.0 CHANGELOG entry below [Unreleased] per Keep a Changelog; add the 1.3.0 link reference. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 57 out of 57 changed files in this pull request and generated 1 comment.

+        $needle = $this->normalize($sample->expectedOutput);
+
+        $matchedRank = null;
+        foreach ($ranked->topKTexts($k) as $index => $text) {
+            $haystack = $this->normalize($text);
+            if ($haystack !== '' && str_contains($haystack, $needle)) {
+                $matchedRank = $index + 1;
+                break;
+            }
+        }
+
+        return new MetricScore($matchedRank !== null ? 1.0 : 0.0, [
+            'k' => $k,
+            'expected_span' => $sample->expectedOutput,
+            'matched_rank' => $matchedRank,
+        ]);
+    }
+
+    private function normalize(string $value): string
+    {
+        $collapsed = preg_replace('/\s+/u', ' ', $value) ?? $value;
+        $collapsed = trim($collapsed);
+
+        return $this->caseSensitive ? $collapsed : mb_strtolower($collapsed, 'UTF-8');
+    }


lopadova and others added 4 commits June 16, 2026 20:29

feat(support): add RuntimeOptions::normalizeUnitInterval clamp helper

2a9aa92

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs: document v1.3.0 retrieval metrics, calibration and online monit…

1edbe80

…oring Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(calibration): add calibrate-judge command test + fixture

409a6cb

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lopadova requested a review from Copilot June 16, 2026 18:36

Copilot started reviewing on behalf of lopadova June 16, 2026 18:36 View session

chatgpt-codex-connector Bot reviewed Jun 16, 2026

View reviewed changes

Copilot AI reviewed Jun 16, 2026

View reviewed changes

lopadova mentioned this pull request Jun 16, 2026

feat: add online monitoring screen (pass-rate trend + drift alert) padosoft/eval-harness-admin#6

Merged

lopadova requested a review from Copilot June 16, 2026 19:48

Copilot started reviewing on behalf of lopadova June 16, 2026 19:48 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

lopadova requested a review from Copilot June 16, 2026 20:04

Copilot started reviewing on behalf of lopadova June 16, 2026 20:05 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

lopadova merged commit 5332e17 into main Jun 16, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: v1.3.0 — retrieval metrics, judge calibration, online monitoring#45

feat: v1.3.0 — retrieval metrics, judge calibration, online monitoring#45
lopadova merged 6 commits into
mainfrom
feat/eval-harness-v1.3.0-enhancements

lopadova commented Jun 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lopadova commented Jun 16, 2026

Summary

Improvement 3 — Ordinal / distance metric

Improvement 4 — Retrieval-ranking metrics (first-class)

Improvement 2 — Judge calibration

Improvement 1 — Online / production monitoring (off by default)

Shared

Tests / gates (local)

Docs

Follow-ups

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants