Medmarks

Open-source LLM benchmark suite for medical tasks.

Medmarks is a comprehensive benchmark suite for evaluating medical capabilities in large language models. It includes 30 open-source benchmarks spanning question answering, information extraction, consumer health questions, clinical reasoning, EHR interactions, medical calculations, and open-ended medical tasks.

This repository contains the runnable benchmark environments, evaluation configs, result processing tools, and win-rate analysis pipeline used for Medmarks. It also contains the medarc_verifiers Python library, which provides the shared CLI, parsers, rewards, judging utilities, and orchestration helpers used by the benchmark environments.

Benchmark Suite

Medmarks is organized into three practical subsets:

Subset	Description
Medmarks-V	Verifiable tasks, including multiple-choice QA and other tasks with deterministic or programmatic grading
Medmarks-OE	Open-ended tasks evaluated with LLM-as-a-Judge
Medmarks-T	Experimental training-capable environments with train/test splits for post-training and RL experiments

The benchmark suite is implemented as verifiers environments under environments/. The main runnable suite configs are:

Config	Purpose
`configs/medmarks-verified.toml`	Medmarks-V suite
`configs/medmarks-open_ended.toml`	Medmarks-OE suite
`configs/medmarks-endpoints.toml`	Portable model aliases and sampling defaults for Medmarks runs
`configs/medmarks-smoke.toml`	Small Medmarks-V sanity-check run

Quick Start

uv venv --python 3.12
source .venv/bin/activate
uv sync

Run a single benchmark:

uv run medarc-eval medqa -m openai/gpt-4.1-mini -n 25

Run a Medmarks suite config:

uv run medarc-eval bench --config configs/medmarks-verified.toml

Run a Medmarks suite with one of the published model aliases:

uv run medarc-eval bench \
  --config configs/medmarks-verified.toml \
  --endpoints-path configs/medmarks-endpoints.toml \
  -m gpt-oss-20b-low \
  --api-base-url https://api.pinference.ai/api/v1 \
  --api-key-var PRIME_API_KEY

configs/medmarks-endpoints.toml is an alias registry, not a deployment config. It maps names such as gpt-oss-20b-low or medgemma-27b-text to provider model IDs, client types, and model-specific sampling defaults. It intentionally omits url, key, and max_concurrent; supply those with --provider or with --api-base-url and --api-key-var for your deployment. The gpt-oss aliases use the Verifiers openai_responses client type.

Preview the resolved jobs before running:

uv run medarc-eval bench \
  --config configs/medmarks-verified.toml \
  --endpoints-path configs/medmarks-endpoints.toml \
  -m gpt-oss-20b-low \
  --api-base-url https://api.pinference.ai/api/v1 \
  --api-key-var PRIME_API_KEY \
  --dry-run

Run the same alias against a local vLLM server exposing an OpenAI-compatible API:

VLLM_API_KEY=local-key uv run medarc-eval bench \
  --config configs/medmarks-verified.toml \
  --endpoints-path configs/medmarks-endpoints.toml \
  -m gpt-oss-20b-low \
  --api-base-url http://127.0.0.1:8000/v1 \
  --api-key-var VLLM_API_KEY \
  --dry-run

Process outputs and compute win rates:

uv run medarc-eval process --runs-dir runs/evals
uv run medarc-eval winrate

Evaluation outputs are written under runs/evals/, processed parquet files under runs/processed/, and win-rate summaries under runs/processed/winrate/.

Documentation

Page	Description
`docs/developer-guide.md`	Developer setup, environment authoring, and local workflow
`docs/medarc-eval.md`	Full `medarc-eval` CLI documentation
`docs/medarc-eval-bench.md`	TOML benchmark suite execution
`docs/medarc-eval-process.md`	Processing eval outputs into parquet
`docs/medarc-eval-winrate.md`	HELM-style win-rate computation
`docs/medarc-orchestrate.md`	Running local vLLM benchmark jobs with Docker or Slurm/Pyxis

Datasets

-- indicates no dedicated training split. Not specified means we found no explicit dataset license in the dataset source. Evaluated counts reflect the effective Medmarks evaluation split or configured subset; MedDialog is intentionally capped at the first 2,500 examples.

Dataset	Description	License / terms	# Evaluated	# Training
Medmarks-V (Verifiable)
CareQA	Healthcare QA exam questions with multiple-choice reasoning questions, English subset.	Apache-2.0	5,621	--
HEAD-QA v2	Extended healthcare questions spanning 10 years of Spanish professional exams, English subset.	MIT	12,751	--
LongHealth	Long-context synthetic patient cases with information extraction and sorting tasks, task1 and task2 splits.	Apache-2.0	1,200	--
M-ARC	Long-tail medical questions designed to test model resistance to inflexible clinical reasoning patterns.	Apache-2.0	100	--
Med-HALT	Clinical Reasoning Hallucination detection via false confidence tests and "none of the above" recognition.	Apache-2.0	22,152	--
MedCalc-Bench	Clinical calculator questions evaluating medical computation and formula application skills.	CC-BY-SA-4.0	1,100	10,538
MedConceptsQA	Multiple-choice questions on medical coding systems, e.g., ICD-9, ICD-10, etc., only ICD-10CM subsamples evaluated.	Not specified	6,000	--
Medbullets	USMLE Step 2 and Step 3 style clinical reasoning questions sourced from social media.	Not specified	308	--
MedHallu	Medical hallucination detection benchmark with four domain-specific error categories derived from the PubMedQA dataset.	MIT	2,000	--
MedMCQA	Multiple-choice questions from Indian medical entrance exams across 21 medical subjects.	Apache-2.0	4,183	182,822
MedQA	Multiple-choice questions from USMLE medical licensing exams.	CC-BY-4.0	1,273	10,178
MedXpertQA	High-difficulty MCQ questions with ~10 options across 17 specialties to evaluate expert-level medical knowledge, text subset.	MIT	2,450	--
MetaMedQA	Questions testing model's awareness and recognition of unanswerable medical queries using uncertainty options.	CC-BY-4.0	1,373	--
MMLU-Pro-Health	Health subset of MMLU-Pro benchmark featuring general health-related questions with up to 10 answer options per question.	MIT	818	--
PubHealthBench	Multiple-choice questions derived from UK government public health guidance documents, reviewed subset.	CC-BY-4.0	760	--
PubMedQA	Yes/no/maybe question answering requiring reasoning over biomedical research abstracts, labeled subset.	MIT	500	211,269
SCTPublic	Script Concordance Tests evaluating clinical reasoning under diagnostic uncertainty.	MIT	174	--
SuperGPQA-Med	Graduate-level questions spanning 6 medical fields, easy and hard difficulty subsets.	ODC-BY	1,126	--
Medmarks-OE (Open-Ended)
ACI-Bench	Clinical dialogue transcripts paired with corresponding structured clinical notes.	CC-BY-4.0	210	114
AgentClinic	Multimodal multi-agent OSCE-style clinical dialogues for interactive diagnostic reasoning evaluation.	MIT	214	--
CareQA Open	Healthcare QA exam questions with open-ended reasoning questions, English subset.	Apache-2.0	2,769	--
HealthBench	Multi-turn healthcare conversations evaluated using physician-written scoring rubrics.	MIT	5,000	--
MedAgentBench v2	Agentic electronic health record tasks requiring FHIR API interactions.	Not specified; V1 MIT	600	--
MedCaseReasoning	Diagnostic QA with clinician-authored reasoning traces from clinical case reports.	MIT	500	13,092
MedDialog	Large-scale patient-doctor conversations for medical dialogue generation and understanding; Medmarks evaluates a small subsample.	Not specified	2,500	205,973
MedExQA	Questions with dual expert explanations across 5 underrepresented medical specialties.	CC-BY-NC-SA-4.0	940	--
MedicationQA	Consumer-style medication questions with expert-validated answers from MedlinePlus.	CC-BY-4.0	690	--
MEDEC	Medical dataset for clinical error detection, extraction, and correction in synthetic medical notes.	CC-BY-4.0	597	2,189
MedR-Bench	Clinical reasoning benchmark with step-by-step diagnostic and treatment planning traces on rare disease cases.	CC-BY-SA-4.0	1,453	--
MTSamples	Transcribed medical operative notes and reports evaluating models on procedural summaries and clinically appropriate treatment plans.	Not specified	559	--

Citation

@misc{warner2026medmarkscomprehensiveopensourcellm,
      title={Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks},
      author={Benjamin Warner and Ratna Sagari Grandhi and Max Kieffer and Aymane Ouraq and Saurav Panigrahi and Geetu Ambwani and Kunal Bagga and Nikhil Khandekar and Arya Hariharan and Nishant Mishra and Manish Ram and Shamus Sim Zi Yang and Ahmed Essouaied and Adepoju Jeremiah Moyondafoluwa and Robert Scholz and Bofeng Huang and Molly Beavers and Srishti Gureja and Anish Mahishi and Sameed Khan and Maxime Griot and Hunar Batra and Jean-Benoit Delbrouck and Siddhant Bharadwaj and Ronald Clark and Ashish Vashist and Anas Zafar and Leema Krishna Murali and Harsh Deshpande and Ameen Patel and William Brown and Johannes Hagemann and Connor Lane and Paul Steven Scotti and Tanishq Mathew Abraham},
      year={2026},
      eprint={2605.01417},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.01417},
}

License

Medmarks code in this repository is released under the MIT License. Individual benchmark datasets may have their own licenses or terms of use; consult the corresponding dataset sources and environment documentation before redistribution or commercial use.

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
environments		environments
medarc_verifiers		medarc_verifiers
scripts		scripts
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medmarks

Benchmark Suite

Quick Start

Documentation

Datasets

Citation

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Medmarks

Benchmark Suite

Quick Start

Documentation

Datasets

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages