EvalLedger

EvalLedger is a registry for AI benchmark provenance. It exists to make benchmark versions citable, benchmark artifacts verifiable, and contamination checks inspectable instead of implicit.

Why it exists

The current benchmark ecosystem makes it hard to answer three basic questions:

Which exact artifact was evaluated?
Was the benchmark likely seen during pretraining?
Can another researcher reproduce the result later?

EvalLedger stores benchmark metadata, version records, artifact hashes, and contamination reports in one public ledger so those questions have durable answers.

Quick start

docker compose up --build
cd backend && uv run alembic upgrade head
cd backend && uv run python -m app.scripts.seed

Then open http://localhost:3000 for the web interface and http://localhost:8000/docs for the API.

Repository guide

Contributor workflow: CONTRIBUTING.md
Metadata standard: standard/METADATA_STANDARD.md
Machine-readable schema: standard/metadata_schema.json
Maintainer and operations runbooks: docs/README.md

Authentication

Sign in via GitHub or Google at /login. No password required.

CLI setup

# 1. Sign in at https://evalledger-frontend.vercel.app/login (GitHub or Google)
# 2. Create an API key at https://evalledger-frontend.vercel.app/account
# 3. Configure the CLI:
cd cli
uv sync
uv run evalledger login --api-key el_your_api_key_here

The key is validated against the API and stored in ~/.evalledger/config.toml.

CLI usage

uv run evalledger submit --name "MMLU" --slug mmlu --version 2.0.0 --file ./mmlu.jsonl --domain reasoning --task-type multiple_choice --paper https://arxiv.org/abs/2009.03300 --license MIT
uv run evalledger verify mmlu 2.0.0
uv run evalledger search reasoning
uv run evalledger info mmlu 2.0.0

API overview

curl http://localhost:8000/search?q=reasoning
curl http://localhost:8000/benchmarks/mmlu
curl http://localhost:8000/stats/overview

Contamination methodology

EvalLedger uses MinHash sketches to approximate textual overlap between submitted artifacts and reference corpora. Each example is tokenized, hashed into a fixed-size sketch, queried against a corpus-local LSH index, and then rechecked with exact token-level Jaccard similarity before being flagged. The result is not a proof of contamination; it is a reproducible, inspectable overlap signal.

Metadata standard

The repository ships the human-readable standard in standard/METADATA_STANDARD.md and the machine schema in standard/metadata_schema.json.

Contributing a benchmark

Register the benchmark metadata through the CLI or web submit flow.
Upload a versioned artifact with a semantic version.
Review the generated contamination report before citing the record publicly.

Contributing code

Use make dev for local development, make migrate after schema changes, make seed to load reference data, make test for automated checks, and make lint for static analysis.

Operations

Operational runbooks live in docs/operations/ and maintainer process notes live in docs/maintainers/. They cover incident response, backup and restore drills, release flow, migration discipline, and performance verification.

Load testing

Use the built-in scenarios below to get a first-pass latency and throughput read on the API:

make loadtest API_URL=http://localhost:8000
make loadtest-account API_URL=http://localhost:8000 API_KEY=<user-api-key>
make loadtest-review API_URL=http://localhost:8000 API_KEY=<admin-api-key>

The harness reports success rate, throughput, p50/p95/p99 latency, and per-endpoint summaries for browse, account, and review flows. It is intended for controlled local or staging checks rather than internet-scale benchmarking. See docs/operations/performance.md for the full workflow and guidance on separating Render cold-start effects from real query regressions.

Product policies

Draft product-facing policy pages ship with the frontend at /privacy, /terms, and /acceptable-use. They are implementation-complete, but they should still receive legal review before a public launch.

Roadmap

Replace miniature local reference indices with large-scale corpus builders.
Add first-class citation ingestion and external paper backlink tracking.
Publish signed export snapshots for long-term archival use.

Citing EvalLedger

@software{evalledger_2026,
  title  = {EvalLedger},
  year   = {2026},
  url    = {https://evalledger.dev},
  note   = {Registry for benchmark provenance and contamination reporting}
}

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
backend		backend
cli		cli
docs		docs
frontend		frontend
images		images
standard		standard
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
render.yaml		render.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvalLedger

Why it exists

Quick start

Repository guide

Authentication

CLI setup

CLI usage

API overview

Contamination methodology

Metadata standard

Contributing a benchmark

Contributing code

Operations

Load testing

Product policies

Roadmap

Citing EvalLedger

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EvalLedger

Why it exists

Quick start

Repository guide

Authentication

CLI setup

CLI usage

API overview

Contamination methodology

Metadata standard

Contributing a benchmark

Contributing code

Operations

Load testing

Product policies

Roadmap

Citing EvalLedger

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages