Add ORTGenAI backend option to benchmark CLI by GopalakrishnanN · Pull Request #2420 · microsoft/Olive

GopalakrishnanN · 2026-04-17T18:37:26Z

Context

The benchmark command currently defaults to the ONNX Runtime lm-eval model path. Olive already has ORTGenAI lm-eval support in the evaluator layer, but benchmark CLI had no way to select it.

This PR exposes that capability through benchmark CLI while preserving existing defaults.

What This Changes

Adds a new benchmark CLI argument: --backend with choices:
- auto (default)
- ort
- ortgenai
Wires explicit backend selection into generated workflow config by setting evaluator model_class when backend is not auto.
Keeps current behavior unchanged when --backend auto is used (or omitted).
Adds validation: explicit --backend is only accepted for ONNX input models.

Why This Approach

Non-breaking by default: existing benchmark flows continue to infer model class automatically.
Minimal change surface: only benchmark CLI config generation and tests are touched.
Leverages existing evaluator support rather than introducing new runtime logic.

User-Facing Behavior

Examples:

Existing behavior (unchanged):
- olive benchmark -m <model> --tasks arc_easy
Explicit ORT:
- olive benchmark -m <onnx_model> --tasks arc_easy --backend ort
Explicit ORTGenAI:
- olive benchmark -m <onnx_model> --tasks arc_easy --backend ortgenai

If --backend is provided for non-ONNX inputs, benchmark now raises a clear error.

Tests Added/Updated

Verifies ONNX benchmark accepts --backend ortgenai and writes evaluator model_class=ortgenai.
Verifies non-ONNX model with explicit backend raises expected ValueError.
Existing benchmark tests continue to pass.

Validation

pip install -e .
python -m olive --help
python -m olive benchmark --help
python -m pytest test/cli/test_cli.py -k benchmark_command -q

Copilot

Pull request overview

This PR adds an explicit backend selection option to the olive benchmark CLI so users can choose between ONNX Runtime and ORTGenAI evaluation backends when benchmarking ONNX inputs, while keeping the default automatic behavior unchanged.

Changes:

Added --backend {auto,ort,ortgenai} to the benchmark CLI (default: auto).
Implemented fast, offline validation to reject explicit backends for non-ONNX inputs before any HuggingFace hub checks.
Added CLI tests to confirm model_class wiring for ortgenai and to ensure invalid usage fails without hitting the HF hub.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`olive/cli/benchmark.py`	Adds the `--backend` flag, performs offline ONNX validation, and sets evaluator `model_class` when backend is explicitly selected.
`test/cli/test_cli.py`	Adds coverage for `--backend ortgenai` config generation and for early error behavior on non-ONNX inputs without HF hub access.

GopalakrishnanN · 2026-04-17T18:46:58Z

@microsoft-github-policy-service agree company="Microsoft"

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

jambayk · 2026-04-17T19:12:47Z

    def _get_run_config(self, tempdir: str) -> dict:
        config = deepcopy(TEMPLATE)

+        # Validate --backend before get_input_model_config, which may trigger a


the copilot suggestion are overcomplicating things. i think it's better to remove the is_local_onnx_model changes completely. and just check that the input_model_config that you get after line 103 is onnxmodel when args.backend is not auto

GopalakrishnanN · 2026-04-23T01:49:10Z

End-to-end verification on real ONNX model

Ran olive benchmark against microsoft/Phi-3-mini-4k-instruct-onnx (cpu-int4-rtn-block-32-acc-level-4, GenAI-packaged with genai_config.json) with --tasks arc_easy --device cpu --limit 5 --batch_size 1, exercising both backends end-to-end.

	`--backend ortgenai`	`--backend ort`
Generated workflow `model_class`	`"ortgenai"`	`"ort"`
Loglikelihood requests	20/20 @ 1.44 it/s	20/20 @ 1.43 it/s
`arc_easy-acc` / `acc_norm`	0.6 / 0.6	0.6 / 0.6
Underlying runtime	`og.Config` / `og.Model` / `og.Generator` (`LMEvalORTGenAIEvaluator`)	`onnxruntime.InferenceSession` (`LMEvalORTEvaluator`)

Confirms the full chain: CLI --backend flag -> workflow config evaluators.evaluator.model_class -> LMEvaluator.evaluate() -> lm_eval.api.registry.get_model(model_class) -> @register_model("ort") / @register_model("ortgenai") class in olive/evaluator/lmeval_ort.py. Accuracy numbers agree across backends on the same weights, validating both paths produce sensible results.

Side note (not part of this PR): Olive's evaluation cache keys on model_id and does not include model_class; back-to-back runs with different --backend values on the same model currently reuse the first result unless .olive-cache/<workflow>/evaluations/ is cleared. Flagging for a possible follow-up.

vraspar

Nice, clean PR. Minimal surface, follows the existing to_replace pattern, and correctly implements jambayk's feedback from #2396. A few actionable items below, none blocking.

1. Missing backward-compat assertion in existing tests
The existing test_benchmark_command_hfmodel and test_benchmark_command_onnxmodel don't assert that model_class is absent from the config when --backend is omitted entirely. Adding assert "model_class" not in config["evaluators"]["evaluator"] to those two tests would lock in the backward-compat guarantee.

2. Cache key follow-up
As you noted in the comments, evaluation cache doesn't include model_class (or tasks, limit, batch_size, etc.), so switching --backend on the same model silently reuses stale results. This PR makes that easier to hit. Not in scope here, but worth a tracking issue.

vraspar · 2026-04-24T18:31:39Z

+            type=str,
+            default="auto",
+            choices=["auto", "ort", "ortgenai"],
+            help="Backend for ONNX model evaluation. Use 'auto' to infer backend from model type.",


nit: The help string says "Backend for ONNX model evaluation" but auto is also valid (and the default) for HF/PT models. It just falls through to evaluator auto-detection. Consider something like:

"Backend for lm-eval model evaluation. 'ort' and 'ortgenai' require ONNX input. 'auto' infers backend from model type."

vraspar · 2026-04-24T18:31:39Z

            "onnxmodel",
        }, "Only HfModel, PyTorchModel and OnnxModel are supported in benchmark command."

+        if self.args.backend != "auto" and input_model_config["type"].lower() != "onnxmodel":


Optional: ortgenai requires GenAI-packaged model assets (genai_config.json, etc.), not just any .onnx file. Right now a user can pass --backend ortgenai on a plain ONNX model and get a confusing runtime error deep inside lm_eval.

If you want to keep this simple (and I think you should), maybe just extend the error message or help text to hint at the requirement. No need for asset validation here.

vraspar · 2026-04-24T18:31:46Z

Note: this review was generated with help from GitHub Copilot CLI.

Copilot AI review requested due to automatic review settings April 17, 2026 18:37

Copilot started reviewing on behalf of GopalakrishnanN April 17, 2026 18:38 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Comment thread olive/cli/benchmark.py Outdated

GopalakrishnanN force-pushed the dev/AddORTGenAIBackEndOption branch from 6e4e1b3 to 60a5f37 Compare April 17, 2026 18:41

jambayk reviewed Apr 17, 2026

View reviewed changes

Comment thread olive/cli/benchmark.py Outdated

GopalakrishnanN requested a review from Copilot April 17, 2026 18:58

Copilot started reviewing on behalf of GopalakrishnanN April 17, 2026 18:58 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Comment thread olive/cli/benchmark.py Outdated

Comment thread olive/cli/benchmark.py

jambayk reviewed Apr 17, 2026

View reviewed changes

GopalakrishnanN force-pushed the dev/AddORTGenAIBackEndOption branch 4 times, most recently from 7491f39 to eda6f0b Compare April 23, 2026 01:22

Add ORTGenAI backend option to benchmark CLI

236701a

GopalakrishnanN force-pushed the dev/AddORTGenAIBackEndOption branch from eda6f0b to 236701a Compare April 23, 2026 19:54

jambayk approved these changes Apr 23, 2026

View reviewed changes

vraspar reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ORTGenAI backend option to benchmark CLI#2420

Add ORTGenAI backend option to benchmark CLI#2420
GopalakrishnanN wants to merge 1 commit intomainfrom
dev/AddORTGenAIBackEndOption

GopalakrishnanN commented Apr 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

GopalakrishnanN commented Apr 17, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

jambayk Apr 17, 2026

Uh oh!

GopalakrishnanN commented Apr 23, 2026

Uh oh!

vraspar left a comment

Uh oh!

vraspar Apr 24, 2026

Uh oh!

vraspar Apr 24, 2026

Uh oh!

vraspar commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

GopalakrishnanN commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What This Changes

Why This Approach

User-Facing Behavior

Tests Added/Updated

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

GopalakrishnanN commented Apr 17, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

jambayk Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

GopalakrishnanN commented Apr 23, 2026

Uh oh!

vraspar left a comment

Choose a reason for hiding this comment

Uh oh!

vraspar Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

vraspar Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

vraspar commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GopalakrishnanN commented Apr 17, 2026 •

edited

Loading