Skip to content

feat: v1.3.0 — retrieval metrics, judge calibration, online monitoring#45

Merged
lopadova merged 6 commits into
mainfrom
feat/eval-harness-v1.3.0-enhancements
Jun 16, 2026
Merged

feat: v1.3.0 — retrieval metrics, judge calibration, online monitoring#45
lopadova merged 6 commits into
mainfrom
feat/eval-harness-v1.3.0-enhancements

Conversation

@lopadova

Copy link
Copy Markdown
Contributor

Summary

Implements the 2026-06-16-eval-harness-enhancements plan as a single, additive v1.3.0 release train (the four improvements share MetricResolver, config/eval-harness.php, the service provider and README, so per-improvement branches were impractical; the plan explicitly sanctions folding all into v1.3.0). All v1 contracts are preserved.

Improvement 3 — Ordinal / distance metric

  • ordinal-distance: partial credit for ordered labels (exact 1.0 / off-by-one 0.5 / further 0.0), per-sample metadata.ordinal_scale override.

Improvement 4 — Retrieval-ranking metrics (first-class)

  • retrieval-hit-at-k, retrieval-recall-at-k, retrieval-mrr, retrieval-ndcg-at-k (binary or graded gains), answer-containment-at-k.
  • Shared RankedRetrieval parser/value object + AbstractRetrievalRankingMetric base.
  • metrics.retrieval.default_k config + per-sample metadata.k override. Aliases auto-wire from the container with zero extra binding (RetrievalAliasResolutionTest).

Improvement 2 — Judge calibration

  • eval-harness:calibrate-judge command: verdict agreement rate, confusion matrix, length-bias signal (Spearman), self-preference guard. Markdown/JSON output, CI gating.
  • HumanLabel, CalibrationCaseLoader, JudgeCalibrator, JudgeCalibrationReport; calibration.* config.

Improvement 1 — Online / production monitoring (off by default)

  • OnlineMonitor::capture(), OnlineSamplingDecision, queueable JudgeLiveSampleJob, first package migration (eval_harness_online_scores) + OnlineScore model.
  • OnlineTrendRepository, OnlineDriftAlert + OnlinePassRateDropped event.
  • Read-only GET /{prefix}/online/{dataset}/trend endpoint (eval-harness.report-api.v1.online-trend), eval-harness-migrations publish tag, online.* config.

Shared

  • RuntimeOptions::normalizeUnitInterval() clamp helper for [0,1] fractions.

Tests / gates (local)

  • composer validate --strict — valid.
  • vendor/bin/phpunitOK (863 tests, 2335 assertions) Unit + (3 tests, 783 assertions) Architecture. (4 PHP 8.5 framework-internal deprecations, pre-existing.)
  • vendor/bin/phpstan analyse --memory-limit=512M — no errors.
  • vendor/bin/pint --test — passed.
  • Opt-in tests/Live/LiveOnlineMonitorTest.php self-skips without EVAL_HARNESS_LIVE_API_KEY.

Docs

  • README updated across all sections (features, metrics, API endpoints, three new usage sections, configuration, registry, roadmap).
  • REPORT_API_CONTRACT.md (online trend), CHANGELOG.md (v1.3.0), docs/PROGRESS.md, docs/LESSON.md.

Follow-ups

  • Tag v1.3.0 after merge (current latest v1.2.0); the AskMyDocs delegation plan pins ^1.3.0.
  • Companion eval-harness-admin needs the Online Monitoring screen consuming the new endpoint — separate repo/PR.

🤖 Generated with Claude Code

lopadova and others added 4 commits June 16, 2026 20:29
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ring

Improvement 3: ordinal-distance metric (partial credit for ordered labels).
Improvement 4: retrieval-ranking family (hit@k, recall@k, MRR, nDCG@k,
  answer-containment@k) on a shared RankedRetrieval parser + base, with
  metrics.retrieval.default_k config and per-sample metadata.k override.
Improvement 2: eval-harness:calibrate-judge command validating the LLM
  judge against human labels (verdict agreement, length-bias, self-preference).
Improvement 1: online production monitoring (OnlineMonitor::capture, queued
  JudgeLiveSampleJob, eval_harness_online_scores migration + model, trend
  repository, drift alert event, read-only online/{dataset}/trend endpoint).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oring

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 409a6cb014

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/Console/CalibrateJudgeCommand.php Outdated
Comment on lines +74 to +78
if (! $this->writeOrPrintReport($payload)) {
return self::FAILURE;
}

return $this->finalize($report, $config);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep JSON output parseable before emitting diagnostics

When --json is used without --out, this writes the JSON report to stdout and then calls finalize(), which can append warn()/error() text for length bias, self-preference, or low agreement. In those failure/warning scenarios, CI or other callers that parse the advertised machine-readable JSON output receive extra non-JSON text after the object; run the gating diagnostics before printing, or send them to a separate stream / file path.

Useful? React with 👍 / 👎.

Comment thread src/Online/OnlineTrendRepository.php Outdated
// these rows are projections, not hydrated models.
$rows = OnlineScore::forDataset($dataset)
->toBase()
->selectRaw('DATE(judged_at) as day_bucket, count(*) as total_count, sum(passed) as passed_count')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use a portable boolean aggregate for trends

In PostgreSQL-backed Laravel apps, passed is a boolean column and sum(passed) is not a valid aggregate, so the new online trend endpoint fails as soon as it queries stored scores. This works on SQLite/MySQL because booleans coerce to integers, but the package should use a portable expression such as a CASE WHEN passed THEN 1 ELSE 0 END sum for the passed_count.

Useful? React with 👍 / 👎.

* @property array<string, mixed>|null $details
* @property Carbon $judged_at
*/
final class OnlineScore extends Model

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Declare the database component for the new model

This new Eloquent model makes online monitoring depend on illuminate/database, but composer.json still declares only the other Illuminate components. In installs that rely on this package's component requirements rather than a full laravel/framework dependency, resolving OnlineScore or the online trend path will fail with a missing Illuminate\Database\Eloquent\Model; add the database component as a runtime dependency or avoid Eloquent for the optional store.

Useful? React with 👍 / 👎.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR delivers the v1.3.0 additive release train for the eval-harness Laravel package, adding new scoring metrics (ordinal + retrieval-ranking), judge calibration tooling, and an opt-in online monitoring subsystem with a read-only trend API and first package migration.

Changes:

  • Added retrieval-ranking metrics (hit@k, recall@k, MRR, nDCG@k, answer containment) backed by a shared RankedRetrieval parser and configurable default_k.
  • Added judge calibration pipeline (eval-harness:calibrate-judge) including strict YAML case loading, calibration reporting, and CI gating signals (agreement/length-bias/self-preference).
  • Added online monitoring (sampling gate, queued judging job, drift alert event, persistence + trend API endpoint) plus RuntimeOptions::normalizeUnitInterval().

Reviewed changes

Copilot reviewed 56 out of 56 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/Unit/Support/RuntimeOptionsTest.php Adds coverage for new unit-interval normalization behavior.
tests/Unit/ReportApi/Online/OnlineTrendControllerTest.php Verifies online trend API envelope, aggregation, and path traversal handling.
tests/Unit/Online/OnlineTrendRepositoryTest.php Tests per-day pass-rate aggregation and limiting behavior.
tests/Unit/Online/OnlineScoreModelTest.php Validates model table name, casts, persistence, and dataset filtering helper.
tests/Unit/Online/OnlineSamplingDecisionTest.php Tests config-driven sampling gate behavior with deterministic randomizer.
tests/Unit/Online/OnlineMonitorTest.php Ensures capture dispatch behavior respects enablement and sampling rate.
tests/Unit/Online/OnlineDriftAlertTest.php Tests drift detection logic and event dispatch thresholds.
tests/Unit/Online/JudgeLiveSampleJobTest.php Validates queued job persists scores and pass/fail thresholding without provider calls.
tests/Unit/Metrics/RetrievalRecallAtKMetricTest.php Covers recall@k scoring behavior.
tests/Unit/Metrics/RetrievalNdcgAtKMetricTest.php Covers nDCG@k scoring for binary and graded relevance.
tests/Unit/Metrics/RetrievalMrrMetricTest.php Covers MRR scoring behavior.
tests/Unit/Metrics/RetrievalHitAtKMetricTest.php Covers hit@k scoring plus metadata k override behavior.
tests/Unit/Metrics/RetrievalAliasResolutionTest.php Ensures retrieval metric aliases resolve via container without extra bindings.
tests/Unit/Metrics/Retrieval/RankedRetrievalTest.php Tests retrieval JSON parsing, deduping, and expected-output gain handling.
tests/Unit/Metrics/OrdinalDistanceMetricTest.php Covers ordinal-distance scoring and metadata scale override.
tests/Unit/Metrics/AnswerContainmentAtKMetricTest.php Covers answer containment scoring over top-k retrieved texts.
tests/Unit/Console/CalibrateJudgeCommandTest.php Ensures calibrate-judge command gates correctly and emits JSON output.
tests/Unit/Calibration/JudgeCalibratorTest.php Validates agreement, confusion matrix, length bias correlation, and self-preference guard.
tests/Unit/Calibration/CalibrationCaseLoaderTest.php Tests strict-schema calibration YAML validation and error cases.
tests/Live/LiveOnlineMonitorTest.php Adds opt-in live test for end-to-end online capture against a real judge.
tests/Fixtures/calibration/judge-cases.v1.yaml Adds calibration fixture YAML used by command/unit tests.
src/Support/RuntimeOptions.php Adds normalizeUnitInterval() for safe [0,1] fraction parsing/clamping.
src/ReportApi/ReportApiSchema.php Adds schema discriminator for the online trend endpoint.
src/ReportApi/Online/OnlineTrendResource.php Adds JSON resource shaping for online trend responses.
src/ReportApi/Online/OnlineTrendController.php Implements read-only online trend endpoint, limit parsing, and dataset validation.
src/Online/OnlineTrendRepository.php Adds SQL aggregation for per-day online pass-rate points.
src/Online/OnlineScore.php Introduces persisted online score model and typed dataset query starter.
src/Online/OnlineSamplingDecision.php Adds config-driven sampling gate with injectable randomizer.
src/Online/OnlineMonitor.php Adds host-app entrypoint to capture and dispatch online judging jobs.
src/Online/OnlineDriftAlert.php Adds drift detection over a recent window and dispatches alert event.
src/Online/JudgeLiveSampleJob.php Adds queued job to evaluate a sampled interaction, persist score, and re-check drift.
src/Online/Events/OnlinePassRateDropped.php Adds drift event contract for host-app alert routing.
src/Metrics/RetrievalRecallAtKMetric.php Implements recall@k metric.
src/Metrics/RetrievalNdcgAtKMetric.php Implements nDCG@k metric.
src/Metrics/RetrievalMrrMetric.php Implements MRR metric.
src/Metrics/RetrievalHitAtKMetric.php Implements hit@k metric.
src/Metrics/Retrieval/RankedRetrieval.php Adds shared retrieval parser/value object for ranked ids/texts.
src/Metrics/Retrieval/AbstractRetrievalRankingMetric.php Adds shared base resolving k and parsing ranked + gains for retrieval metrics.
src/Metrics/OrdinalDistanceMetric.php Implements ordinal-distance metric with per-sample scale override.
src/Metrics/MetricResolver.php Extends built-in alias map to include new metrics.
src/Metrics/Metric.php Updates metric interface doc to list new built-ins.
src/Metrics/AnswerContainmentAtKMetric.php Adds answer containment@k metric over retrieved texts.
src/EvalHarnessServiceProvider.php Registers calibration/online services, loads migrations, publishes migrations, and wires command.
src/Console/CalibrateJudgeCommand.php Adds calibrate-judge command and report output handling.
src/Calibration/JudgeCalibrator.php Implements calibration run logic, strict JSON decoding, and Spearman correlation.
src/Calibration/JudgeCalibrationReport.php Adds immutable calibration report value object + array export.
src/Calibration/HumanLabel.php Adds value object for human-labelled calibration cases.
src/Calibration/CalibrationCaseLoader.php Adds strict-schema YAML loader for calibration cases.
routes/eval-harness-api.php Adds the new online trend endpoint route.
README.md Documents new metrics, calibration command, online monitoring, config blocks, and API endpoint.
docs/REPORT_API_CONTRACT.md Documents the online trend API contract and payload shape.
docs/PROGRESS.md Records implementation progress and local gate results for v1.3.0 work.
docs/LESSON.md Captures implementation lessons around aggregates, PHPStan constraints, and migrations.
database/migrations/2026_06_16_000000_create_eval_harness_online_scores_table.php Adds first package migration for online score persistence.
config/eval-harness.php Adds retrieval, calibration, and online configuration blocks.
CHANGELOG.md Adds v1.3.0 release notes entry.

Comment thread src/Online/OnlineTrendRepository.php
Comment thread database/migrations/2026_06_16_000000_create_eval_harness_online_scores_table.php Outdated
Comment on lines +49 to +56
private function random(): float
{
if ($this->randomizer !== null) {
return ($this->randomizer)();
}

return mt_rand() / mt_getrandmax();
}
Comment thread docs/REPORT_API_CONTRACT.md Outdated
Comment on lines +140 to +143
- Aggregates `eval_harness_online_scores` rows into per-day
(UTC date) pass-rate points, ascending by date. `threshold` echoes
`eval-harness.online.alert.threshold` so a dashboard can draw the
alert band.
- Declare illuminate/database (Eloquent OnlineScore model + migration).
- Portable boolean aggregate: SUM(CASE WHEN passed THEN 1 ELSE 0 END) so
  the online trend works on PostgreSQL, not just MySQL/SQLite.
- Drop redundant single-column dataset/judged_at indexes; the composite
  (dataset, judged_at) index covers every hot query.
- Sampling draw stays in [0, 1): divide by (mt_getrandmax() + 1).
- calibrate-judge: keep --json stdout parseable by suppressing the human
  diagnostic lines when JSON is written to stdout (data is in the payload).
- Soften the online-trend contract wording to not promise UTC bucketing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 57 out of 57 changed files in this pull request and generated 2 comments.

Comment on lines +24 to +28
// Single composite index covers every hot query
// (forDataset()->orderByDesc('judged_at') and the grouped
// trend aggregate); standalone dataset/judged_at indexes
// would be redundant write amplification.
$table->index(['dataset', 'judged_at']);
Comment thread CHANGELOG.md Outdated
Comment on lines +8 to +12
## [1.3.0] - 2026-06-16

Additive, backward-compatible feature set. All v1 contracts are preserved.

### Added
- Add id to the online-scores composite index ((dataset, judged_at, id))
  so the drift-alert ORDER BY judged_at DESC, id DESC tie-break is covered.
- Move the 1.3.0 CHANGELOG entry below [Unreleased] per Keep a Changelog;
  add the 1.3.0 link reference.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 57 out of 57 changed files in this pull request and generated 1 comment.

Comment on lines +52 to +76
$needle = $this->normalize($sample->expectedOutput);

$matchedRank = null;
foreach ($ranked->topKTexts($k) as $index => $text) {
$haystack = $this->normalize($text);
if ($haystack !== '' && str_contains($haystack, $needle)) {
$matchedRank = $index + 1;
break;
}
}

return new MetricScore($matchedRank !== null ? 1.0 : 0.0, [
'k' => $k,
'expected_span' => $sample->expectedOutput,
'matched_rank' => $matchedRank,
]);
}

private function normalize(string $value): string
{
$collapsed = preg_replace('/\s+/u', ' ', $value) ?? $value;
$collapsed = trim($collapsed);

return $this->caseSensitive ? $collapsed : mb_strtolower($collapsed, 'UTF-8');
}
@lopadova lopadova merged commit 5332e17 into main Jun 16, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants