EvalLedger is a registry for AI benchmark provenance. It exists to make benchmark versions citable, benchmark artifacts verifiable, and contamination checks inspectable instead of implicit.
The current benchmark ecosystem makes it hard to answer three basic questions:
- Which exact artifact was evaluated?
- Was the benchmark likely seen during pretraining?
- Can another researcher reproduce the result later?
EvalLedger stores benchmark metadata, version records, artifact hashes, and contamination reports in one public ledger so those questions have durable answers.
docker compose up --build
cd backend && uv run alembic upgrade head
cd backend && uv run python -m app.scripts.seedThen open http://localhost:3000 for the web interface and http://localhost:8000/docs for the API.
- Contributor workflow:
CONTRIBUTING.md - Metadata standard:
standard/METADATA_STANDARD.md - Machine-readable schema:
standard/metadata_schema.json - Maintainer and operations runbooks:
docs/README.md
Sign in via GitHub or Google at /login. No password required.
# 1. Sign in at https://evalledger-frontend.vercel.app/login (GitHub or Google)
# 2. Create an API key at https://evalledger-frontend.vercel.app/account
# 3. Configure the CLI:
cd cli
uv sync
uv run evalledger login --api-key el_your_api_key_hereThe key is validated against the API and stored in ~/.evalledger/config.toml.
uv run evalledger submit --name "MMLU" --slug mmlu --version 2.0.0 --file ./mmlu.jsonl --domain reasoning --task-type multiple_choice --paper https://arxiv.org/abs/2009.03300 --license MIT
uv run evalledger verify mmlu 2.0.0
uv run evalledger search reasoning
uv run evalledger info mmlu 2.0.0curl http://localhost:8000/search?q=reasoning
curl http://localhost:8000/benchmarks/mmlu
curl http://localhost:8000/stats/overviewEvalLedger uses MinHash sketches to approximate textual overlap between submitted artifacts and reference corpora. Each example is tokenized, hashed into a fixed-size sketch, queried against a corpus-local LSH index, and then rechecked with exact token-level Jaccard similarity before being flagged. The result is not a proof of contamination; it is a reproducible, inspectable overlap signal.
The repository ships the human-readable standard in standard/METADATA_STANDARD.md and the machine schema in standard/metadata_schema.json.
- Register the benchmark metadata through the CLI or web submit flow.
- Upload a versioned artifact with a semantic version.
- Review the generated contamination report before citing the record publicly.
Use make dev for local development, make migrate after schema changes, make seed to load reference data, make test for automated checks, and make lint for static analysis.
Operational runbooks live in docs/operations/ and maintainer process notes live in docs/maintainers/. They cover incident response, backup and restore drills, release flow, migration discipline, and performance verification.
Use the built-in scenarios below to get a first-pass latency and throughput read on the API:
make loadtest API_URL=http://localhost:8000
make loadtest-account API_URL=http://localhost:8000 API_KEY=<user-api-key>
make loadtest-review API_URL=http://localhost:8000 API_KEY=<admin-api-key>The harness reports success rate, throughput, p50/p95/p99 latency, and per-endpoint summaries for browse, account, and review flows. It is intended for controlled local or staging checks rather than internet-scale benchmarking. See docs/operations/performance.md for the full workflow and guidance on separating Render cold-start effects from real query regressions.
Draft product-facing policy pages ship with the frontend at /privacy, /terms, and /acceptable-use. They are implementation-complete, but they should still receive legal review before a public launch.
- Replace miniature local reference indices with large-scale corpus builders.
- Add first-class citation ingestion and external paper backlink tracking.
- Publish signed export snapshots for long-term archival use.
@software{evalledger_2026,
title = {EvalLedger},
year = {2026},
url = {https://evalledger.dev},
note = {Registry for benchmark provenance and contamination reporting}
}