Production-grade RAG evaluation toolkit with LLM-as-judge, cost accounting, and CI/CD regression gates.
This monorepo provides a composable suite of packages for evaluating Retrieval-Augmented Generation (RAG) systems across four core metrics — faithfulness, relevance, context precision, and context recall — with heuristic scoring, LLM-based judging, budget enforcement, and automated quality gating for CI pipelines.
- Four evaluation metrics — faithfulness, relevance, context precision, and context recall with heuristic scorers
- LLM-as-judge — multi-provider judging (Anthropic, OpenAI, Google) with calibration and consensus voting
- Cost accounting — per-sample and per-run token tracking with budget enforcement and alert thresholds
- Quality gates — threshold and baseline-comparison gates with formatted CI output and exit codes
- MCP server — three-layer tool API (
judge.*,suite.*,gate.*) for agent-driven evaluation - Dataset management — multi-format loading, Zod validation, synthetic generation, and version tracking
- Observability — structured Pino logging, OpenTelemetry tracing, and Prometheus-compatible metrics
- Dual ESM/CJS — every package ships
cjsandesmoutput for maximum compatibility
Packages are published under the @reaatech scope and can be installed individually:
# Core types and schemas
pnpm add @reaatech/rag-eval-core
# Metric scorers
pnpm add @reaatech/rag-eval-metrics
# LLM judge
pnpm add @reaatech/rag-eval-judge
# Cost tracking
pnpm add @reaatech/rag-eval-cost
# Quality gates
pnpm add @reaatech/rag-eval-gate
# Dataset management
pnpm add @reaatech/rag-eval-dataset
# Central orchestrator
pnpm add @reaatech/rag-eval-suite
# MCP server
pnpm add @reaatech/rag-eval-mcp-server @modelcontextprotocol/sdk
# CLI tool
pnpm add @reaatech/rag-eval-cli
# Observability utilities
pnpm add @reaatech/rag-eval-observability# Clone the repository
git clone https://github.com/reaatech/rag-eval-pack.git
cd rag-eval-pack
# Install dependencies
pnpm install
# Build all packages
pnpm build
# Run the test suite
pnpm test
# Run linting
pnpm lintEvaluate a RAG system's output in a few lines:
import { EvaluationSuite } from "@reaatech/rag-eval-suite";
const suite = new EvaluationSuite({
metrics: ["faithfulness", "relevance", "context_precision", "context_recall"],
judge: { model: "claude-opus" },
gates: [
{ name: "min-faithfulness", type: "threshold", metric: "avg_faithfulness", operator: ">=", threshold: 0.85 },
],
cost: { budget_limit: 10.00 },
});
const result = await suite.runFromFile("datasets/eval-samples.jsonl");
console.log("Overall score:", result.results.metrics.overall_score);
console.log("Faithfulness:", result.results.metrics.avg_faithfulness);
console.log("Total cost:", result.results.total_cost);
console.log("Gates passed:", result.gate_result?.passed);Or use the CLI:
rag-eval-pack evaluate --dataset dataset.jsonl --output results.json
rag-eval-pack gate --results results.json --gates gates.yaml
rag-eval-pack report --results results.json --output report.mdSee datasets/examples/ for sample datasets and configuration files.
| Package | Description |
|---|---|
@reaatech/rag-eval-core |
Canonical types, Zod schemas, and domain models |
@reaatech/rag-eval-metrics |
Heuristic metric scorers (faithfulness, relevance, precision, recall) |
@reaatech/rag-eval-judge |
LLM-as-judge with calibration, consensus, and cost tracking |
@reaatech/rag-eval-cost |
Pricing, budgeting, and cost reporting |
@reaatech/rag-eval-gate |
Quality gates and CI regression checks |
@reaatech/rag-eval-dataset |
Dataset loading, validation, generation, and versioning |
@reaatech/rag-eval-suite |
Central orchestration engine |
@reaatech/rag-eval-mcp-server |
MCP server for agent-driven evaluation |
@reaatech/rag-eval-cli |
CLI entry point and commands |
@reaatech/rag-eval-observability |
Structured logging, tracing, and metrics |
ARCHITECTURE.md— System design, package relationships, and data flowsAGENTS.md— Coding conventions, tool architecture, and development guidelinesCONTRIBUTING.md— Contribution workflow and release processDEV_PLAN.md— Development checklist and roadmap