Skip to content

biru-codeastromer/EvalLedger

Repository files navigation

EvalLedger

EvalLedger is a registry for AI benchmark provenance. It exists to make benchmark versions citable, benchmark artifacts verifiable, and contamination checks inspectable instead of implicit.

Why it exists

The current benchmark ecosystem makes it hard to answer three basic questions:

  1. Which exact artifact was evaluated?
  2. Was the benchmark likely seen during pretraining?
  3. Can another researcher reproduce the result later?

EvalLedger stores benchmark metadata, version records, artifact hashes, and contamination reports in one public ledger so those questions have durable answers.

Quick start

docker compose up --build
cd backend && uv run alembic upgrade head
cd backend && uv run python -m app.scripts.seed

Then open http://localhost:3000 for the web interface and http://localhost:8000/docs for the API.

Repository guide

  • Contributor workflow: CONTRIBUTING.md
  • Metadata standard: standard/METADATA_STANDARD.md
  • Machine-readable schema: standard/metadata_schema.json
  • Maintainer and operations runbooks: docs/README.md

Authentication

Sign in via GitHub or Google at /login. No password required.

CLI setup

# 1. Sign in at https://evalledger-frontend.vercel.app/login (GitHub or Google)
# 2. Create an API key at https://evalledger-frontend.vercel.app/account
# 3. Configure the CLI:
cd cli
uv sync
uv run evalledger login --api-key el_your_api_key_here

The key is validated against the API and stored in ~/.evalledger/config.toml.

CLI usage

uv run evalledger submit --name "MMLU" --slug mmlu --version 2.0.0 --file ./mmlu.jsonl --domain reasoning --task-type multiple_choice --paper https://arxiv.org/abs/2009.03300 --license MIT
uv run evalledger verify mmlu 2.0.0
uv run evalledger search reasoning
uv run evalledger info mmlu 2.0.0

API overview

curl http://localhost:8000/search?q=reasoning
curl http://localhost:8000/benchmarks/mmlu
curl http://localhost:8000/stats/overview

Contamination methodology

EvalLedger uses MinHash sketches to approximate textual overlap between submitted artifacts and reference corpora. Each example is tokenized, hashed into a fixed-size sketch, queried against a corpus-local LSH index, and then rechecked with exact token-level Jaccard similarity before being flagged. The result is not a proof of contamination; it is a reproducible, inspectable overlap signal.

Metadata standard

The repository ships the human-readable standard in standard/METADATA_STANDARD.md and the machine schema in standard/metadata_schema.json.

Contributing a benchmark

  1. Register the benchmark metadata through the CLI or web submit flow.
  2. Upload a versioned artifact with a semantic version.
  3. Review the generated contamination report before citing the record publicly.

Contributing code

Use make dev for local development, make migrate after schema changes, make seed to load reference data, make test for automated checks, and make lint for static analysis.

Operations

Operational runbooks live in docs/operations/ and maintainer process notes live in docs/maintainers/. They cover incident response, backup and restore drills, release flow, migration discipline, and performance verification.

Load testing

Use the built-in scenarios below to get a first-pass latency and throughput read on the API:

make loadtest API_URL=http://localhost:8000
make loadtest-account API_URL=http://localhost:8000 API_KEY=<user-api-key>
make loadtest-review API_URL=http://localhost:8000 API_KEY=<admin-api-key>

The harness reports success rate, throughput, p50/p95/p99 latency, and per-endpoint summaries for browse, account, and review flows. It is intended for controlled local or staging checks rather than internet-scale benchmarking. See docs/operations/performance.md for the full workflow and guidance on separating Render cold-start effects from real query regressions.

Product policies

Draft product-facing policy pages ship with the frontend at /privacy, /terms, and /acceptable-use. They are implementation-complete, but they should still receive legal review before a public launch.

Roadmap

  1. Replace miniature local reference indices with large-scale corpus builders.
  2. Add first-class citation ingestion and external paper backlink tracking.
  3. Publish signed export snapshots for long-term archival use.

Citing EvalLedger

@software{evalledger_2026,
  title  = {EvalLedger},
  year   = {2026},
  url    = {https://evalledger.dev},
  note   = {Registry for benchmark provenance and contamination reporting}
}

About

Open registry for AI benchmark provenance, where researchers can version, cite, and verify benchmark artifacts with transparent contamination checks.

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors