feat: v1.3.0 — retrieval metrics, judge calibration, online monitoring#45
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ring
Improvement 3: ordinal-distance metric (partial credit for ordered labels).
Improvement 4: retrieval-ranking family (hit@k, recall@k, MRR, nDCG@k,
answer-containment@k) on a shared RankedRetrieval parser + base, with
metrics.retrieval.default_k config and per-sample metadata.k override.
Improvement 2: eval-harness:calibrate-judge command validating the LLM
judge against human labels (verdict agreement, length-bias, self-preference).
Improvement 1: online production monitoring (OnlineMonitor::capture, queued
JudgeLiveSampleJob, eval_harness_online_scores migration + model, trend
repository, drift alert event, read-only online/{dataset}/trend endpoint).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oring Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 409a6cb014
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (! $this->writeOrPrintReport($payload)) { | ||
| return self::FAILURE; | ||
| } | ||
|
|
||
| return $this->finalize($report, $config); |
There was a problem hiding this comment.
Keep JSON output parseable before emitting diagnostics
When --json is used without --out, this writes the JSON report to stdout and then calls finalize(), which can append warn()/error() text for length bias, self-preference, or low agreement. In those failure/warning scenarios, CI or other callers that parse the advertised machine-readable JSON output receive extra non-JSON text after the object; run the gating diagnostics before printing, or send them to a separate stream / file path.
Useful? React with 👍 / 👎.
| // these rows are projections, not hydrated models. | ||
| $rows = OnlineScore::forDataset($dataset) | ||
| ->toBase() | ||
| ->selectRaw('DATE(judged_at) as day_bucket, count(*) as total_count, sum(passed) as passed_count') |
There was a problem hiding this comment.
Use a portable boolean aggregate for trends
In PostgreSQL-backed Laravel apps, passed is a boolean column and sum(passed) is not a valid aggregate, so the new online trend endpoint fails as soon as it queries stored scores. This works on SQLite/MySQL because booleans coerce to integers, but the package should use a portable expression such as a CASE WHEN passed THEN 1 ELSE 0 END sum for the passed_count.
Useful? React with 👍 / 👎.
| * @property array<string, mixed>|null $details | ||
| * @property Carbon $judged_at | ||
| */ | ||
| final class OnlineScore extends Model |
There was a problem hiding this comment.
Declare the database component for the new model
This new Eloquent model makes online monitoring depend on illuminate/database, but composer.json still declares only the other Illuminate components. In installs that rely on this package's component requirements rather than a full laravel/framework dependency, resolving OnlineScore or the online trend path will fail with a missing Illuminate\Database\Eloquent\Model; add the database component as a runtime dependency or avoid Eloquent for the optional store.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR delivers the v1.3.0 additive release train for the eval-harness Laravel package, adding new scoring metrics (ordinal + retrieval-ranking), judge calibration tooling, and an opt-in online monitoring subsystem with a read-only trend API and first package migration.
Changes:
- Added retrieval-ranking metrics (hit@k, recall@k, MRR, nDCG@k, answer containment) backed by a shared
RankedRetrievalparser and configurabledefault_k. - Added judge calibration pipeline (
eval-harness:calibrate-judge) including strict YAML case loading, calibration reporting, and CI gating signals (agreement/length-bias/self-preference). - Added online monitoring (sampling gate, queued judging job, drift alert event, persistence + trend API endpoint) plus
RuntimeOptions::normalizeUnitInterval().
Reviewed changes
Copilot reviewed 56 out of 56 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/Unit/Support/RuntimeOptionsTest.php | Adds coverage for new unit-interval normalization behavior. |
| tests/Unit/ReportApi/Online/OnlineTrendControllerTest.php | Verifies online trend API envelope, aggregation, and path traversal handling. |
| tests/Unit/Online/OnlineTrendRepositoryTest.php | Tests per-day pass-rate aggregation and limiting behavior. |
| tests/Unit/Online/OnlineScoreModelTest.php | Validates model table name, casts, persistence, and dataset filtering helper. |
| tests/Unit/Online/OnlineSamplingDecisionTest.php | Tests config-driven sampling gate behavior with deterministic randomizer. |
| tests/Unit/Online/OnlineMonitorTest.php | Ensures capture dispatch behavior respects enablement and sampling rate. |
| tests/Unit/Online/OnlineDriftAlertTest.php | Tests drift detection logic and event dispatch thresholds. |
| tests/Unit/Online/JudgeLiveSampleJobTest.php | Validates queued job persists scores and pass/fail thresholding without provider calls. |
| tests/Unit/Metrics/RetrievalRecallAtKMetricTest.php | Covers recall@k scoring behavior. |
| tests/Unit/Metrics/RetrievalNdcgAtKMetricTest.php | Covers nDCG@k scoring for binary and graded relevance. |
| tests/Unit/Metrics/RetrievalMrrMetricTest.php | Covers MRR scoring behavior. |
| tests/Unit/Metrics/RetrievalHitAtKMetricTest.php | Covers hit@k scoring plus metadata k override behavior. |
| tests/Unit/Metrics/RetrievalAliasResolutionTest.php | Ensures retrieval metric aliases resolve via container without extra bindings. |
| tests/Unit/Metrics/Retrieval/RankedRetrievalTest.php | Tests retrieval JSON parsing, deduping, and expected-output gain handling. |
| tests/Unit/Metrics/OrdinalDistanceMetricTest.php | Covers ordinal-distance scoring and metadata scale override. |
| tests/Unit/Metrics/AnswerContainmentAtKMetricTest.php | Covers answer containment scoring over top-k retrieved texts. |
| tests/Unit/Console/CalibrateJudgeCommandTest.php | Ensures calibrate-judge command gates correctly and emits JSON output. |
| tests/Unit/Calibration/JudgeCalibratorTest.php | Validates agreement, confusion matrix, length bias correlation, and self-preference guard. |
| tests/Unit/Calibration/CalibrationCaseLoaderTest.php | Tests strict-schema calibration YAML validation and error cases. |
| tests/Live/LiveOnlineMonitorTest.php | Adds opt-in live test for end-to-end online capture against a real judge. |
| tests/Fixtures/calibration/judge-cases.v1.yaml | Adds calibration fixture YAML used by command/unit tests. |
| src/Support/RuntimeOptions.php | Adds normalizeUnitInterval() for safe [0,1] fraction parsing/clamping. |
| src/ReportApi/ReportApiSchema.php | Adds schema discriminator for the online trend endpoint. |
| src/ReportApi/Online/OnlineTrendResource.php | Adds JSON resource shaping for online trend responses. |
| src/ReportApi/Online/OnlineTrendController.php | Implements read-only online trend endpoint, limit parsing, and dataset validation. |
| src/Online/OnlineTrendRepository.php | Adds SQL aggregation for per-day online pass-rate points. |
| src/Online/OnlineScore.php | Introduces persisted online score model and typed dataset query starter. |
| src/Online/OnlineSamplingDecision.php | Adds config-driven sampling gate with injectable randomizer. |
| src/Online/OnlineMonitor.php | Adds host-app entrypoint to capture and dispatch online judging jobs. |
| src/Online/OnlineDriftAlert.php | Adds drift detection over a recent window and dispatches alert event. |
| src/Online/JudgeLiveSampleJob.php | Adds queued job to evaluate a sampled interaction, persist score, and re-check drift. |
| src/Online/Events/OnlinePassRateDropped.php | Adds drift event contract for host-app alert routing. |
| src/Metrics/RetrievalRecallAtKMetric.php | Implements recall@k metric. |
| src/Metrics/RetrievalNdcgAtKMetric.php | Implements nDCG@k metric. |
| src/Metrics/RetrievalMrrMetric.php | Implements MRR metric. |
| src/Metrics/RetrievalHitAtKMetric.php | Implements hit@k metric. |
| src/Metrics/Retrieval/RankedRetrieval.php | Adds shared retrieval parser/value object for ranked ids/texts. |
| src/Metrics/Retrieval/AbstractRetrievalRankingMetric.php | Adds shared base resolving k and parsing ranked + gains for retrieval metrics. |
| src/Metrics/OrdinalDistanceMetric.php | Implements ordinal-distance metric with per-sample scale override. |
| src/Metrics/MetricResolver.php | Extends built-in alias map to include new metrics. |
| src/Metrics/Metric.php | Updates metric interface doc to list new built-ins. |
| src/Metrics/AnswerContainmentAtKMetric.php | Adds answer containment@k metric over retrieved texts. |
| src/EvalHarnessServiceProvider.php | Registers calibration/online services, loads migrations, publishes migrations, and wires command. |
| src/Console/CalibrateJudgeCommand.php | Adds calibrate-judge command and report output handling. |
| src/Calibration/JudgeCalibrator.php | Implements calibration run logic, strict JSON decoding, and Spearman correlation. |
| src/Calibration/JudgeCalibrationReport.php | Adds immutable calibration report value object + array export. |
| src/Calibration/HumanLabel.php | Adds value object for human-labelled calibration cases. |
| src/Calibration/CalibrationCaseLoader.php | Adds strict-schema YAML loader for calibration cases. |
| routes/eval-harness-api.php | Adds the new online trend endpoint route. |
| README.md | Documents new metrics, calibration command, online monitoring, config blocks, and API endpoint. |
| docs/REPORT_API_CONTRACT.md | Documents the online trend API contract and payload shape. |
| docs/PROGRESS.md | Records implementation progress and local gate results for v1.3.0 work. |
| docs/LESSON.md | Captures implementation lessons around aggregates, PHPStan constraints, and migrations. |
| database/migrations/2026_06_16_000000_create_eval_harness_online_scores_table.php | Adds first package migration for online score persistence. |
| config/eval-harness.php | Adds retrieval, calibration, and online configuration blocks. |
| CHANGELOG.md | Adds v1.3.0 release notes entry. |
| private function random(): float | ||
| { | ||
| if ($this->randomizer !== null) { | ||
| return ($this->randomizer)(); | ||
| } | ||
|
|
||
| return mt_rand() / mt_getrandmax(); | ||
| } |
| - Aggregates `eval_harness_online_scores` rows into per-day | ||
| (UTC date) pass-rate points, ascending by date. `threshold` echoes | ||
| `eval-harness.online.alert.threshold` so a dashboard can draw the | ||
| alert band. |
- Declare illuminate/database (Eloquent OnlineScore model + migration). - Portable boolean aggregate: SUM(CASE WHEN passed THEN 1 ELSE 0 END) so the online trend works on PostgreSQL, not just MySQL/SQLite. - Drop redundant single-column dataset/judged_at indexes; the composite (dataset, judged_at) index covers every hot query. - Sampling draw stays in [0, 1): divide by (mt_getrandmax() + 1). - calibrate-judge: keep --json stdout parseable by suppressing the human diagnostic lines when JSON is written to stdout (data is in the payload). - Soften the online-trend contract wording to not promise UTC bucketing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| // Single composite index covers every hot query | ||
| // (forDataset()->orderByDesc('judged_at') and the grouped | ||
| // trend aggregate); standalone dataset/judged_at indexes | ||
| // would be redundant write amplification. | ||
| $table->index(['dataset', 'judged_at']); |
| ## [1.3.0] - 2026-06-16 | ||
|
|
||
| Additive, backward-compatible feature set. All v1 contracts are preserved. | ||
|
|
||
| ### Added |
- Add id to the online-scores composite index ((dataset, judged_at, id)) so the drift-alert ORDER BY judged_at DESC, id DESC tie-break is covered. - Move the 1.3.0 CHANGELOG entry below [Unreleased] per Keep a Changelog; add the 1.3.0 link reference. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| $needle = $this->normalize($sample->expectedOutput); | ||
|
|
||
| $matchedRank = null; | ||
| foreach ($ranked->topKTexts($k) as $index => $text) { | ||
| $haystack = $this->normalize($text); | ||
| if ($haystack !== '' && str_contains($haystack, $needle)) { | ||
| $matchedRank = $index + 1; | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| return new MetricScore($matchedRank !== null ? 1.0 : 0.0, [ | ||
| 'k' => $k, | ||
| 'expected_span' => $sample->expectedOutput, | ||
| 'matched_rank' => $matchedRank, | ||
| ]); | ||
| } | ||
|
|
||
| private function normalize(string $value): string | ||
| { | ||
| $collapsed = preg_replace('/\s+/u', ' ', $value) ?? $value; | ||
| $collapsed = trim($collapsed); | ||
|
|
||
| return $this->caseSensitive ? $collapsed : mb_strtolower($collapsed, 'UTF-8'); | ||
| } |
Summary
Implements the
2026-06-16-eval-harness-enhancementsplan as a single, additive v1.3.0 release train (the four improvements shareMetricResolver,config/eval-harness.php, the service provider and README, so per-improvement branches were impractical; the plan explicitly sanctions folding all into v1.3.0). All v1 contracts are preserved.Improvement 3 — Ordinal / distance metric
ordinal-distance: partial credit for ordered labels (exact 1.0 / off-by-one 0.5 / further 0.0), per-samplemetadata.ordinal_scaleoverride.Improvement 4 — Retrieval-ranking metrics (first-class)
retrieval-hit-at-k,retrieval-recall-at-k,retrieval-mrr,retrieval-ndcg-at-k(binary or graded gains),answer-containment-at-k.RankedRetrievalparser/value object +AbstractRetrievalRankingMetricbase.metrics.retrieval.default_kconfig + per-samplemetadata.koverride. Aliases auto-wire from the container with zero extra binding (RetrievalAliasResolutionTest).Improvement 2 — Judge calibration
eval-harness:calibrate-judgecommand: verdict agreement rate, confusion matrix, length-bias signal (Spearman), self-preference guard. Markdown/JSON output, CI gating.HumanLabel,CalibrationCaseLoader,JudgeCalibrator,JudgeCalibrationReport;calibration.*config.Improvement 1 — Online / production monitoring (off by default)
OnlineMonitor::capture(),OnlineSamplingDecision, queueableJudgeLiveSampleJob, first package migration (eval_harness_online_scores) +OnlineScoremodel.OnlineTrendRepository,OnlineDriftAlert+OnlinePassRateDroppedevent.GET /{prefix}/online/{dataset}/trendendpoint (eval-harness.report-api.v1.online-trend),eval-harness-migrationspublish tag,online.*config.Shared
RuntimeOptions::normalizeUnitInterval()clamp helper for[0,1]fractions.Tests / gates (local)
composer validate --strict— valid.vendor/bin/phpunit—OK (863 tests, 2335 assertions)Unit +(3 tests, 783 assertions)Architecture. (4 PHP 8.5 framework-internal deprecations, pre-existing.)vendor/bin/phpstan analyse --memory-limit=512M— no errors.vendor/bin/pint --test— passed.tests/Live/LiveOnlineMonitorTest.phpself-skips withoutEVAL_HARNESS_LIVE_API_KEY.Docs
REPORT_API_CONTRACT.md(online trend),CHANGELOG.md(v1.3.0),docs/PROGRESS.md,docs/LESSON.md.Follow-ups
v1.3.0after merge (current latestv1.2.0); the AskMyDocs delegation plan pins^1.3.0.eval-harness-adminneeds the Online Monitoring screen consuming the new endpoint — separate repo/PR.🤖 Generated with Claude Code