rag-eval-pack

Production-grade RAG evaluation toolkit with LLM-as-judge, cost accounting, and CI/CD regression gates.

This monorepo provides a composable suite of packages for evaluating Retrieval-Augmented Generation (RAG) systems across four core metrics — faithfulness, relevance, context precision, and context recall — with heuristic scoring, LLM-based judging, budget enforcement, and automated quality gating for CI pipelines.

Features

Four evaluation metrics — faithfulness, relevance, context precision, and context recall with heuristic scorers
LLM-as-judge — multi-provider judging (Anthropic, OpenAI, Google) with calibration and consensus voting
Cost accounting — per-sample and per-run token tracking with budget enforcement and alert thresholds
Quality gates — threshold and baseline-comparison gates with formatted CI output and exit codes
MCP server — three-layer tool API (judge.*, suite.*, gate.*) for agent-driven evaluation
Dataset management — multi-format loading, Zod validation, synthetic generation, and version tracking
Observability — structured Pino logging, OpenTelemetry tracing, and Prometheus-compatible metrics
Dual ESM/CJS — every package ships cjs and esm output for maximum compatibility

Installation

Using the packages

Packages are published under the @reaatech scope and can be installed individually:

# Core types and schemas
pnpm add @reaatech/rag-eval-core

# Metric scorers
pnpm add @reaatech/rag-eval-metrics

# LLM judge
pnpm add @reaatech/rag-eval-judge

# Cost tracking
pnpm add @reaatech/rag-eval-cost

# Quality gates
pnpm add @reaatech/rag-eval-gate

# Dataset management
pnpm add @reaatech/rag-eval-dataset

# Central orchestrator
pnpm add @reaatech/rag-eval-suite

# MCP server
pnpm add @reaatech/rag-eval-mcp-server @modelcontextprotocol/sdk

# CLI tool
pnpm add @reaatech/rag-eval-cli

# Observability utilities
pnpm add @reaatech/rag-eval-observability

Contributing

# Clone the repository
git clone https://github.com/reaatech/rag-eval-pack.git
cd rag-eval-pack

# Install dependencies
pnpm install

# Build all packages
pnpm build

# Run the test suite
pnpm test

# Run linting
pnpm lint

Quick Start

Evaluate a RAG system's output in a few lines:

import { EvaluationSuite } from "@reaatech/rag-eval-suite";

const suite = new EvaluationSuite({
  metrics: ["faithfulness", "relevance", "context_precision", "context_recall"],
  judge: { model: "claude-opus" },
  gates: [
    { name: "min-faithfulness", type: "threshold", metric: "avg_faithfulness", operator: ">=", threshold: 0.85 },
  ],
  cost: { budget_limit: 10.00 },
});

const result = await suite.runFromFile("datasets/eval-samples.jsonl");

console.log("Overall score:", result.results.metrics.overall_score);
console.log("Faithfulness:", result.results.metrics.avg_faithfulness);
console.log("Total cost:", result.results.total_cost);
console.log("Gates passed:", result.gate_result?.passed);

Or use the CLI:

rag-eval-pack evaluate --dataset dataset.jsonl --output results.json
rag-eval-pack gate --results results.json --gates gates.yaml
rag-eval-pack report --results results.json --output report.md

See datasets/examples/ for sample datasets and configuration files.

Packages

Package	Description
`@reaatech/rag-eval-core`	Canonical types, Zod schemas, and domain models
`@reaatech/rag-eval-metrics`	Heuristic metric scorers (faithfulness, relevance, precision, recall)
`@reaatech/rag-eval-judge`	LLM-as-judge with calibration, consensus, and cost tracking
`@reaatech/rag-eval-cost`	Pricing, budgeting, and cost reporting
`@reaatech/rag-eval-gate`	Quality gates and CI regression checks
`@reaatech/rag-eval-dataset`	Dataset loading, validation, generation, and versioning
`@reaatech/rag-eval-suite`	Central orchestration engine
`@reaatech/rag-eval-mcp-server`	MCP server for agent-driven evaluation
`@reaatech/rag-eval-cli`	CLI entry point and commands
`@reaatech/rag-eval-observability`	Structured logging, tracing, and metrics

Documentation

ARCHITECTURE.md — System design, package relationships, and data flows
AGENTS.md — Coding conventions, tool architecture, and development guidelines
CONTRIBUTING.md — Contribution workflow and release process
DEV_PLAN.md — Development checklist and roadmap

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.changeset		.changeset
.github		.github
datasets/examples		datasets/examples
docker		docker
infra		infra
packages		packages
skills		skills
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.npmrc		.npmrc
.nvmrc		.nvmrc
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.json		tsconfig.json
tsconfig.typecheck.json		tsconfig.typecheck.json
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rag-eval-pack

Features

Installation

Using the packages

Contributing

Quick Start

Packages

Documentation

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rag-eval-pack

Features

Installation

Using the packages

Contributing

Quick Start

Packages

Documentation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages