bellwether

The cost-and-failure-mode benchmark for LLM agents. Methodology plus Python package for honest, reproducible cross-provider agent evaluation.

Live leaderboard · Methodology

Why

Cross-provider LLM benchmarks today rank capability ("which model is smarter on average"). HELM and Chatbot Arena own that ground.

Practitioners building production systems need a different answer: which provider for THIS task, at THIS cost when retries and failures are accounted for, with THESE failure modes that map to my product's tolerance.

bellwether answers the procurement question and ships the toolkit anyone can run on their own prompts.

What it measures

effective_TCoT: total cost per successfully completed task, including the cost of failed retries. The procurement-question metric, not the average-quality one.
Failure-mode taxonomy: classify how models fail, not just whether (refusal, confabulation, schema break, truncation, partial, off-task, timeout, error). Maps to product-tolerance decisions.
Machine-checkable ground truth only. No LLM-as-judge. Sidesteps the well-documented judge-bias issue.
Prompt portability. Headline numbers use one canonical prompt across providers; portability cost (tuned vs canonical) is a v1 promise with a real contract.

See METHODOLOGY.md for formulas, retry policy, validator contract, and reproducibility caveats.

Install

From source (current; PyPI publish pending):

git clone https://github.com/cartesianxr7/bellwether
cd bellwether
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env       # add ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY
pre-commit install         # optional, gates secret leaks
pytest                     # 120+ tests; all should pass

After v0.1.0 publish to PyPI:

pip install bellwether

Run

bellwether list providers           # show registered provider adapters
bellwether list tasks               # show registered tasks

# Smoke test: 2 instances, 1 run each, $1 cap, takes ~10 seconds and ~$0.01:
bellwether run --instances 2 --n 1 --max-cost 1

# Standard bench: 5 instances, 3 runs per instance, all 3 providers, $5 cap:
bellwether run --instances 5 --n 3 --max-cost 5

# Re-render leaderboard from existing results without re-running:
bellwether report results

The cost guardrail (--max-cost USD) is a hard cap on total spend per invocation. Strongly recommended.

Status

v0.1: methodology, package, CLI, structured-output extraction task across Claude Sonnet 4.6, GPT-4o, and Gemini 2.5 Flash Lite. 1-task leaderboard, 3-pass reproducibility data.

v0.2 through v0.5: function calling (BFCL), RAG (FinanceBench/NQ-open/HotpotQA), multi-step reasoning (GAIA validation set), long-context summarization (GovReport). One task per release.

v1: code-generation task with sandboxing, OpenRouter open-weights, tuned-prompt-track formalization, plugin loader.

Repository

Code: github.com/cartesianxr7/bellwether
Leaderboard: cartesianxr7.github.io/bellwether
Methodology: cartesianxr7.github.io/bellwether/methodology.html
Raw results JSON: results/

Contributing

See CONTRIBUTING.md. Adding a task or a provider adapter is a single PR; the contract is documented and small. Architecture overview in ARCHITECTURE.md; roadmap in ROADMAP.md; community standards in CODE_OF_CONDUCT.md.

Citation

If you use bellwether or its methodology in your work, please cite it. BibTeX:

@software{bellwether2026,
  author = {Hedrick, Stephen},
  title = {bellwether: cost-and-failure-mode benchmark for LLM agents},
  year = {2026},
  version = {0.1.0},
  url = {https://github.com/cartesianxr7/bellwether},
  license = {MIT}
}

CITATION.cff is the machine-readable form (GitHub renders a "Cite this repository" button from it).

License

MIT. See LICENSE.

Author

Stephen Hedrick.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bellwether

Why

What it measures

Install

Run

Status

Repository

Contributing

Citation

License

Author

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github		.github
docs		docs
results		results
site		site
src/bellwether		src/bellwether
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
METHODOLOGY.md		METHODOLOGY.md
README.md		README.md
ROADMAP.md		ROADMAP.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

bellwether

Why

What it measures

Install

Run

Status

Repository

Contributing

Citation

License

Author

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages