A multimodal evaluation framework for scheduling LLM and VLM evaluations across HPC clusters. Built as an orchestration layer over lm-eval, lighteval, and lmms-eval, with a plugin system for contributing custom benchmarks.
- Schedule evaluations on multiple models and tasks:
oellm-eval schedule - Collect results and check for missing evaluations:
oellm-eval collect - Diagnose your environment (cluster vars, HF cache, venv engines):
oellm-eval doctor - Task groups for pre-defined evaluation suites with automatic dataset pre-downloading
- Multi-cluster support with auto-detection (Leonardo, LUMI, JURECA, Jupiter, Snellius)
- Image evaluation via lmms-eval (VQAv2, MMBench, MMMU, ChartQA, DocVQA, TextVQA, OCRBench, MathVista)
- Video evaluation via lmms-eval (VideoMMMU, EgoSchema, VideoMME, ActivityNet-QA, LongVideoBench)
- Audio evaluation via lmms-eval (LibriSpeech, FLEURS, GigaSpeech, TED-LIUM, WenetSpeech, CoVoST2, VocalSound, MuChoMusic)
- Plugin system for contributing custom benchmarks without touching core code
- Automatic building and deployment of containers
| Command | What it does |
|---|---|
oellm-eval schedule |
Expand models × tasks, pre-download models/datasets on the login node, generate and submit a SLURM array job (or run locally with --local) |
oellm-eval eval --config eval.yaml |
Same as schedule, driven by a YAML config file; CLI flags override the file |
oellm-eval collect <dir> |
Aggregate result JSONs into eval_results.csv + .json + .md; --check writes a re-schedulable CSV of missing jobs |
oellm-eval list-tasks |
Show every task group, its engine, task count, and n-shot settings |
oellm-eval compare <a> <b> |
Diff two collected results.json files task by task |
oellm-eval doctor |
Diagnose the environment: cluster detection, env vars, HF cache, venv engines |
Prerequisites:
- Install uv
- Set the
HF_HOMEenvironment variable to point to your HuggingFace cache directory (e.g.export HF_HOME="/path/to/your/hf_home", on LUMI use the path/scratch/project_462000963/cache/huggingface). This is where models and datasets will be cached. Compute nodes typically have no internet access, so all assets must be pre-downloaded into this directory.
# Install
uv tool install -p 3.12 git+https://github.com/elliot-project/elliot-cli.git
# Run evaluations using a task group
oellm-eval schedule \
--models "EleutherAI/pythia-160m" \
--task-groups "open-sci-0.01"
# Image evaluation (requires venv with lmms-eval)
oellm-eval schedule \
--models "llava-hf/llava-1.5-7b-hf" \
--task-groups "image-vqa" \
--venv-path ~/elliot-venvThis will automatically:
- Detect your current HPC cluster (Leonardo, LUMI, JURECA, Jupiter, or Snellius)
- Download and cache the specified models
- Pre-download datasets for known tasks (see warning below)
- Generate and submit a SLURM job array with appropriate cluster-specific resources and using containers built for this cluster
For custom environments instead of containers, pass --venv-path (see docs/VENV.md).
Task groups are pre-defined evaluation suites in task-groups.yaml. Each group specifies tasks, their n-shot settings, and HuggingFace dataset mappings.
| Group | Description | Engine |
|---|---|---|
open-sci-0.01 |
COPA, MMLU, HellaSwag, ARC, etc. | lm-eval |
belebele-eu-5-shot |
Belebele in 23 European languages | lm-eval |
flores-200-eu-to-eng |
EU to English translation | lighteval |
flores-200-eng-to-eu |
English to EU translation | lighteval |
global-mmlu-eu |
Global MMLU in EU languages | lm-eval |
mgsm-eu |
Multilingual GSM8K | lm-eval |
generic-multilingual |
XWinograd, XCOPA, XStoryCloze | lm-eval |
include |
INCLUDE benchmarks (44 languages) | lm-eval |
Super groups: oellm-multilingual (all multilingual benchmarks combined)
| Group | Benchmark | Engine |
|---|---|---|
image-vqa |
All 8 benchmarks combined | lmms-eval |
image-vqav2 |
VQAv2 | lmms-eval |
image-mmbench |
MMBench | lmms-eval |
image-mmmu |
MMMU | lmms-eval |
image-chartqa |
ChartQA | lmms-eval |
image-docvqa |
DocVQA | lmms-eval |
image-textvqa |
TextVQA | lmms-eval |
image-ocrbench |
OCRBench | lmms-eval |
image-mathvista |
MathVista (CoT / format / solution leaves — needs GPT judge) | lmms-eval |
| Group | Benchmark | Engine |
|---|---|---|
video-understanding |
All 5 benchmarks combined | lmms-eval |
video-videommmu |
VideoMMMU (perception / comprehension / adaptation leaves) | lmms-eval |
video-egoschema |
EgoSchema (long-form egocentric QA) | lmms-eval |
video-videomme |
Video-MME (11s-1h clips) | lmms-eval |
video-activitynet-qa |
ActivityNet-QA (requires GPT API) | lmms-eval |
video-longvideobench |
LongVideoBench (cross-segment reasoning) | lmms-eval |
The lmms-eval adapter class (llava_hf, llava_onevision, qwen2_5_vl, etc.) is auto-detected from the model name. Video (like image and audio) tasks run through lmms-eval, which is not included in the cluster containers or any pip extra — set up the general venv as described in docs/VENV.md and pass --venv-path.
| Group | Benchmark | Engine |
|---|---|---|
audio-understanding |
Curated suite: 8 leaf tasks, no judge-model dependency | lmms-eval |
audio-librispeech |
LibriSpeech ASR (WER on test-clean) | lmms-eval |
audio-common-voice-15 |
Common Voice 15 (en, fr, zh-CN) | lmms-eval |
audio-gigaspeech |
GigaSpeech large-scale ASR | lmms-eval |
audio-tedlium |
TED-LIUM v3 ASR | lmms-eval |
audio-wenet-speech |
WenetSpeech Chinese ASR (MER) | lmms-eval |
audio-covost2 |
CoVoST2 en→zh speech translation (BLEU) | lmms-eval |
audio-fleurs |
FLEURS multilingual speech | lmms-eval |
audio-voxpopuli, audio-ami, audio-people-speech |
Additional ASR corpora | lmms-eval |
audio-vocalsound |
Non-speech vocalisation classification | lmms-eval |
audio-muchomusic |
Music understanding MCQ | lmms-eval |
audio-air-bench-chat |
AIR-Bench chat (requires GPT judge) | lmms-eval |
audio-air-bench-foundation |
AIR-Bench foundation MCQ | lmms-eval |
audio-alpaca-audio, audio-openhermes, audio-wavcaps |
Instruction / captioning (GPT judge) | lmms-eval |
audio-clotho-aqa, audio-cn-college-listen-mcq, audio-dream-tts-mcq, audio-voicebench, audio-step2-paralinguistic |
QA / MCQ / paralinguistic probes | lmms-eval |
Audio tasks also run through lmms-eval — use the general venv from docs/VENV.md (the [audio] extra adds the audio decoding helpers, but lmms-eval itself must be installed per that guide). Judge-model groups (AIR-Bench chat, Alpaca-Audio, OpenHermes, WavCaps) need OPENAI_API_KEY on the compute node — scheduling refuses without it unless you pass --allow-missing-judge. The HPC Singularity image must include ffmpeg for non-WAV decode.
Community-contributed benchmarks that run outside the standard evaluation engines. See the contrib registry for the full list.
# Run all 8 image benchmarks
oellm-eval schedule \
--models "llava-hf/llava-1.5-7b-hf" \
--task-groups "image-vqa" \
--venv-path ~/elliot-venv
# Run all 5 video benchmarks
oellm-eval schedule \
--models "lmms-lab/llava-onevision-7b" \
--task-groups "video-understanding" \
--venv-path ~/elliot-venv
# Mix image and text benchmarks in one submission
oellm-eval schedule \
--models "llava-hf/llava-1.5-7b-hf" \
--task-groups "image-mmbench,open-sci-0.01" \
--venv-path ~/elliot-venv
# Use multiple task groups or a super group
oellm-eval schedule --models "model-name" --task-groups "belebele-eu-5-shot,global-mmlu-eu"
oellm-eval schedule --models "model-name" --task-groups "oellm-multilingual"The --local flag lets you run evaluations directly on your machine without a cluster or Singularity container. It generates the same eval script and executes it with bash, injecting fake SLURM environment variables so all tasks run sequentially in a single process. This is useful for testing that tasks and models are correctly configured before submitting to a cluster.
# 1. Add eval dependencies to the project venv
uv pip install lm-eval torch transformers accelerate "datasets<4.0.0"
# 2. Run evaluations locally — useful for smoke-testing with a small sample
oellm-eval schedule \
--models "EleutherAI/pythia-160m" \
--tasks "gsm8k" \
--n-shot 0 \
--venv-path .venv \
--local \
--limit 1Results are written to ./oellm-output/<timestamp>/results/.
Air-gapped cluster nodes (no internet): batch jobs set HF_HUB_OFFLINE=1 and get HF_HOME from your cluster env. With --local, the CLI defaults HF_HOME to ~/.cache/huggingface if unset and would otherwise allow Hub access—so on a compute node without network, export your real cache and offline flag before running, for example:
export HF_HOME=/path/to/your/shared/hf_cache # e.g. $WORK/hf_cache on Leonardo
export HF_HUB_OFFLINE=1
oellm-eval schedule ... --venv-path .venv --localThe HF_HUB_OFFLINE value is read when you invoke oellm-eval and baked into the generated script.
Override cluster defaults (partition, account, time limit, memory, etc.) with --slurm-template-var (JSON object). Provide SLURM_MEM to request an exact host memory amount, otherwise falls back to a default of 96G.
# Use a different partition (e.g. dev-g on LUMI when small-g is crowded)
oellm-eval schedule --models "model-name" --task-groups "open-sci-0.01" \
--slurm-template-var '{"PARTITION":"dev-g"}'
# Multiple overrides: partition, account, time limit, GPUs, exact RAM
oellm-eval schedule --models "model-name" --task-groups "open-sci-0.01" \
--slurm-template-var '{"PARTITION":"dev-g","ACCOUNT":"myproject","TIME":"02:00:00","GPUS_PER_NODE":2,"SLURM_MEM":"96G"}'Use exact env var names: PARTITION, ACCOUNT, GPUS_PER_NODE, SLURM_MEM. TIME (HH:MM:SS) overrides the time limit.
For lighteval runs, generated jobs default to batch_size=1 for local runs and
batch_size=32 for non-local (SLURM/cluster) runs. This reduces the risk of
out-of-memory failures where lighteval's auto batch-size detection can be
overly optimistic for multiple-choice loglikelihood tasks. You can still
override these defaults:
# Set an explicit batch size (overrides the local/cluster default)
BATCH_SIZE=8 oellm-eval schedule \
--models "model-name" \
--task-groups "belebele-eu-cf" \
--venv-path .venvIf you need full manual control over all model args, set MODEL_ARGS,
for example:
MODEL_ARGS='batch_size=8' oellm-eval schedule \
--models "model-name" --task-groups "belebele-eu-cf" --venv-path .venvDatasets are only automatically pre-downloaded for tasks defined in task-groups.yaml.
If you use custom tasks via --tasks that are not in the task groups registry, the CLI will attempt to look them up but cannot guarantee the datasets will be cached. This may cause failures on compute nodes that don't have network access.
Recommendation: Use --task-groups when possible, or ensure your custom task datasets are already cached in $HF_HOME before scheduling.
After evaluations complete, collect results into a CSV. collect recursively searches the given directory for every jobs.csv file and every .json result file, so you can point it at a top-level output folder that contains many sub-runs:
output/
├── hellaswag_mt1/
│ ├── jobs.csv
│ └── results/
├── hellaswag_mt2/
│ ├── jobs.csv
│ └── results/
└── global_mmlu1/
├── jobs.csv
└── results/
# Basic collection — writes eval_results.csv, eval_results.json, eval_results.md
oellm-eval collect /path/to/eval-output-dir
# Check for missing evaluations and create a CSV for re-running them
oellm-eval collect /path/to/eval-output-dir --check --output-csv results.csvThree output files are written next to your --output-csv path: the CSV (raw metric per row), a versioned JSON envelope, and a Markdown table with metrics normalized to a 0–100 scale.
All jobs.csv files found under results_dir are merged into one; if the same (model_path, task_path, n_shot) row appears in multiple files the later-sorted entry wins (override duplicates). The merged jobs list is then compared against all .json result files found recursively.
The --check flag outputs a results_missing.csv that can be used to re-schedule failed jobs:
oellm-eval schedule --eval-csv-path results_missing.csvFor full control, provide a CSV file with columns: model_path, task_path, n_shot, and optionally eval_suite (one of lm_eval — the default, lighteval, lmms_eval, evalchemy, or a contrib suite name):
oellm-eval schedule --eval-csv-path custom_evals.csvNote: field values must not contain commas, quotes, or newlines — the SLURM-side reader splits rows on commas, and scheduling rejects such rows with an error. Model args like
model,revision=...are not supported.
uv tool install -p 3.12 git+https://github.com/elliot-project/elliot-cli.gitUpdate to latest:
uv tool upgrade oellm-evalDue to limited space in $HOME on JSC clusters, set these environment variables:
export UV_CACHE_DIR="/p/project1/<project>/$USER/.cache/uv-cache"
export UV_INSTALL_DIR="/p/project1/<project>/$USER/.local"
export UV_PYTHON_INSTALL_DIR="/p/project1/<project>/$USER/.local/share/uv/python"
export UV_TOOL_DIR="/p/project1/<project>/$USER/.cache/uv-tool-cache"Leonardo, LUMI, JURECA, Jupiter, and Snellius — detected automatically from the login node's hostname (see oellm/resources/clusters.yaml). Any value there can be overridden by exporting the environment variable before scheduling.
schedule verifies before submission that the chosen runtime (venv or container) can actually run the scheduled suites — missing engines, missing suite env vars (e.g. AUDIOBENCH_DIR), and version-pinned groups on the wrong engine (dclm-core-22 needs lm-eval==0.4.9.2) are rejected with an actionable message instead of failing hours later on a compute node. Bypass with --skip-checks.
Run the same checks standalone at any time:
# Full report: cluster detection, env vars, HF cache, SLURM binaries, venv engines
oellm-eval doctor --venv-path /path/to/.venv
# Treat the engines these groups need as required (exit 1 if missing)
oellm-eval doctor --venv-path /path/to/.venv --task-groups "image-vqa,open-sci-0.01"oellm-eval schedule --helpgit clone https://github.com/elliot-project/elliot-cli.git
cd elliot-cli
uv sync --extra dev
# Run all unit tests
uv run pytest tests/ -v
# Download-only mode for testing
uv run oellm-eval schedule --models "EleutherAI/pythia-160m" --task-groups "open-sci-0.01" --download-only| Cluster | Guide |
|---|---|
| Leonardo (CINECA) | docs/LEONARDO.md |
| LUMI, JURECA | Coming soon |
| Snellius | Coming soon |
| Doc | Description |
|---|---|
| Using a Virtual Environment | Setting up a custom venv with lm-eval, lmms-eval, and lighteval |
| Container Workflow | How Apptainer containers are built, deployed, and used |
| Doc | Description |
|---|---|
| Adding Tasks & Task Groups | YAML structure for defining new evaluation suites |
| Contributing Custom Benchmarks | Step-by-step guide for adding a contrib plugin |
| Contrib Registry | List of community-contributed benchmarks |
ELLIOT supports two paths for adding benchmarks:
- Benchmark already in lm-eval / lighteval / lmms-eval -- add a YAML entry to
task-groups.yaml - Fully custom benchmark -- drop a contrib plugin into
oellm/contrib/
See the Contributing Guide for step-by-step instructions.
Containers are deployed manually since PR #46. To build and deploy, select "Run workflow" in Actions.