Skip to content

feat: LLM-as-judge scorer, dataset auto-sampling, eval --ci baseline#78

Merged
Siddhant-K-code merged 1 commit into
mainfrom
feat/eval-llm-judge-dataset-ci
May 17, 2026
Merged

feat: LLM-as-judge scorer, dataset auto-sampling, eval --ci baseline#78
Siddhant-K-code merged 1 commit into
mainfrom
feat/eval-llm-judge-dataset-ci

Conversation

@Siddhant-K-code
Copy link
Copy Markdown
Owner

Closes #65, #66, #69

Changes

LLM-as-judge scorer (score_llm_judge)

  • Calls any OpenAI-compatible endpoint (base_url + api_key + model)
  • Sends a user-supplied prompt with the session event summary
  • Parses {"score": float, "reason": str} from the response, strips markdown fences, clamps score to [0, 1]
  • Dispatched via run_scorer("llm_judge", {...}, events) alongside existing scorers
  • Graceful failure (score=0, passed=False) on missing credentials, HTTP errors, or malformed JSON

Dataset auto-sampling (eval dataset auto)

  • auto_populate(store, path, filter, since_days, label) scans recent sessions and adds matching ones to a .jsonl dataset
  • Six built-in filters: has-errors, high-retry, cost-above:<usd>, wide-blast, long-duration:<Ns>, low-eval-score:<threshold>
  • Deduplicates against existing dataset entries; returns count of newly added sessions
  • CLI: agent-strace eval dataset auto --filter has-errors --since-days 7

eval --ci baseline comparison

  • --save-baseline <path>: saves current scorer scores as a JSON baseline
  • --baseline <path>: loads baseline and checks for regressions
  • --tolerance <float>: allowed score drop before flagging regression (default 0.0)
  • --github-summary: writes .agent-traces/eval-summary.md with a PR-comment-ready Markdown table showing score, baseline, delta, and pass/fail per scorer

Tests

25 new tests in tests/test_eval_extensions.py covering all three features. Full suite: 700 tests, all passing.

…65 #66 #69)

- Add score_llm_judge() scorer: calls any OpenAI-compatible endpoint,
  parses JSON {score, reason}, clamps to [0,1], strips markdown fences
- Dispatch llm_judge through run_scorer() alongside existing scorers
- Add auto_populate() to dataset: 6 signal filters (has-errors,
  high-retry, cost-above, wide-blast, long-duration, low-eval-score),
  since_days window, dedup, optional label
- Extend cmd_eval_ci() with --baseline, --save-baseline, --tolerance,
  --github-summary flags; _load_baseline/_save_baseline/_write_github_summary
- GitHub summary writes PR-comment-ready Markdown with delta vs baseline
- 25 new tests covering all three features (700 total, all passing)

Co-authored-by: Ona <no-reply@ona.com>
@Siddhant-K-code Siddhant-K-code merged commit f404833 into main May 17, 2026
4 checks passed
@Siddhant-K-code Siddhant-K-code deleted the feat/eval-llm-judge-dataset-ci branch May 17, 2026 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: LLM-as-judge scoring on captured traces (agent-strace eval)

1 participant