A lightweight evaluation framework for LLM systems combining:
- deterministic evaluation (critical gates + heuristics)
- probabilistic evaluation (LLM-as-judge)
- regression detection across runs
LLM systems are non-deterministic.
Traditional pass/fail testing is not enough.
This project explores a layered evaluation approach:
Dataset
↓
System Under Test Response
↓
Critical Gates (hard constraints)
↓
Heuristic Scoring (deterministic)
↓
(Optional) Judge Ensemble (LLM)
↓
Regression Comparison
The project now supports multiple evaluation domains via self-contained example packs.
- recommendation-style evaluation
- structured list outputs
- qualitative scoring (taste, tone, diversity)
- baseline reference task
examples/wine_recommendation/
Demonstrates:
- recommendation tasks
- support assistant evaluation
- retrieval-grounded responses (RAG-style)
- simple agent workflows (mock tools)
- structured output expectations
examples/retail_support/
.
│
├── examples/
│ ├── wine_recommendation/
│ └── retail_support/
│
├── configs/
│ ├── tasks/
│ ├── systems/
│ └── judges/
│
├── results/<task_name>/
├── baselines/<task_name>/
│
├── runner.py
├── scorer.py
├── task_loader.py
├── tool_simulator.py
├── schemas.py
├── regression_compare.py
python3 runner.py --task-config configs/tasks/wine.yaml --mode mockpython3 runner.py --task-config examples/retail_support/task_config.yaml --mode mock⸻
python3 runner.py --task-config configs/tasks/wine.yaml --mode mock --write-baseline
python3 runner.py --task-config examples/retail_support/task_config.yaml --mode mock --write-baseline⸻
Using explicit paths:
python3 regression_compare.py baselines/wine_recommendation/baseline_results.json results/wine_recommendation/latest_results.jsonUsing task shortcut:
python3 regression_compare.py --task wine_recommendation
python3 regression_compare.py --task retail_support⸻
• How to evaluate LLM outputs beyond simple correctness
• How to combine heuristics and LLM judges
• How to detect regressions in non-deterministic systems
• How to design evaluation datasets and rubrics
• How to structure reusable evaluation tasks
⸻
This is a V1 learning lab project.
Focus: • clarity over completeness • simplicity over abstraction • experimentation over production design
⸻
• judge disagreement visualization
• evaluation analytics across runs
• cross-model judge comparison
• richer agent workflow evaluation
• dashboard / visualization layer