livebenchviz — LiveBench Dashboard

An interactive dashboard that visualises LiveBench scores for 119 LLMs across 8 axes, with OpenRouter and Ollama inference linkage.

LiveBench is a contamination-free, monthly-updated benchmark. It draws its questions from recent (post-training-cutoff) math competitions, news, and coding contests, so scores are genuinely hard to inflate through memorisation. This project uses LiveBench as its sole benchmark source — no scores are reweighted or modified; the pipeline only aggregates LiveBench's own per-task scores into categories and a global average.

Shown benchmarks

Every model in the dashboard carries 8 score axes (0–100, higher is better), all sourced directly from LiveBench's published release table. Each category is the mean of LiveBench's underlying tasks.

Axis (export key)	LiveBench category	Underlying tasks	What it probes
`lb_avg`	Global average	equal-weighted mean of the 7 categories below	Overall general capability
`lb_coding`	Coding	code_generation, code_completion	LiveCodeBench-style function synthesis & completion
`lb_agentic`	Agentic Coding	javascript, typescript, python	Real-world, repo-style coding tasks
`lb_math`	Mathematics	AMPS_Hard, integrals_with_game, math_comp, olympiad	Competition & symbolic mathematics
`lb_reasoning`	Reasoning	theory_of_mind, zebra_puzzle, spatial, logic_with_navigation	Multi-step logical & spatial reasoning
`lb_data`	Data Analysis	consecutive_events, tablejoin, tablereformat	Tabular reasoning & transformation
`lb_lang`	Language	connections, plot_unscrambling, typos	Language manipulation & comprehension
`lb_instruct`	Instruction Following	paraphrase, simplify, story_generation, summarize	Following precise natural-language instructions

lb_avg is the equal-weighted mean of the seven category scores. All values are rounded to one decimal place. The pipeline does not modify, reweight, or re-normalise any LiveBench score.

Inference & availability linkage

Alongside the scores, the dashboard exposes availability metadata:

openRouterId — for the 33 (of 119) models matched to a live OpenRouter model. The "OpenRouter only" toggle filters to these client-side.
inference.json — per model, whether it is available on Ollama Cloud, pullable for Ollama local, and/or served on OpenRouter.

No pricing data is fetched. OpenRouter and Ollama catalogues are used only for display names, release dates, ID matching, and inference flags.

Repository layout

Path	Purpose
`pipeline/`	Scripts that fetch, normalise, and export LiveBench data → `benchmark_lb.json` + `inference.json`
`website/`	SvelteKit dashboard that renders the exported JSON

Building the data

cd pipeline
pnpm install
pnpm run all # fetch → process → export → copy to website/static

See pipeline/README.md for data sources and status, and pipeline/PIPELINE.md for the full step-by-step walkthrough.

Running the dashboard

pnpm install
pnpm run dev

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.dvc		.dvc
.githooks		.githooks
pipeline		pipeline
website		website
.dvcignore		.dvcignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
USAGE.md		USAGE.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
vexp.toml		vexp.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

livebenchviz — LiveBench Dashboard

Shown benchmarks

Inference & availability linkage

Repository layout

Building the data

Running the dashboard

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

livebenchviz — LiveBench Dashboard

Shown benchmarks

Inference & availability linkage

Repository layout

Building the data

Running the dashboard

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages