feat: add AI gateway benchmark mode by HeyGarrison · Pull Request #85 · computesdk/benchmarks

HeyGarrison · 2026-04-16T22:26:03Z

Summary

add a new ai-gateway benchmark mode with provider configs, scenario-based runs (short-nonstream, short-stream), gateway-specific scoring, and JSON result output under results/ai_gateway/<scenario>/
wire ai-gateway into the main runner and merge pipeline so matrix artifacts can be combined via src/merge-results.ts --mode ai-gateway
add CLI scripts and a dedicated GitHub Actions workflow to run provider/scenario matrix benchmarks and post ranked PR comments

What This Tests

gateway transport performance and reliability using OpenAI-compatible POST /chat/completions
fixed prompt scenarios across providers with a shared model via AI_GATEWAY_MODEL
per-iteration timing and reliability metrics: first token latency, total latency, output token throughput, and success/error status
ranking via a latency-weighted composite score penalized by failures

Out of Scope (Intentional)

model answer quality/correctness evaluation
tool/function-calling behavior
long-context capability testing
cost/pricing benchmarking

Future Scope

gateway-specific platform features while keeping the same non-quality scope, e.g. retries/fallbacks, caching effects (cold vs warm), routing policy behavior, rate-limit handling, and concurrency/queueing behavior

Included In This PR

runtime and scoring implementation under src/ai-gateway/
runner integration in src/run.ts
merge integration in src/merge-results.ts
new scripts in package.json
new workflow: .github/workflows/ai-gateway-benchmarks.yml

Validation

ran npm run bench -- --mode ai-gateway --provider openrouter --iterations 1 (validated mode wiring and skip behavior when creds are missing)
ran npx tsx src/merge-results.ts --input /tmp/ai-gateway-merge-check --mode ai-gateway (validated merge-mode entrypoint)

github-actions · 2026-04-16T22:26:59Z

AI Gateway Benchmark Results

SHORT NONSTREAM

Model: openai/gpt-5.4

#	Provider	Score	First Token	Total	Tok/sec	Status
1	openrouter	99.6	0.23s	0.23s	21.4	50/50
2	vercel-ai-gateway	98.7	0.73s	0.73s	6.8	50/50
3	cloudflare-ai-gateway	98.3	0.77s	0.77s	5.2	50/50

SHORT STREAM

Model: openai/gpt-5.4

#	Provider	Score	First Token	Total	Tok/sec	Status
1	cloudflare-ai-gateway	98.7	0.57s	0.85s	56.4	50/50
2	openrouter	98.7	0.49s	0.85s	50.7	50/50
3	vercel-ai-gateway	98.3	0.60s	1.00s	42.1	50/50

View full run

github-actions · 2026-04-16T22:27:52Z

Sandbox Benchmark Results

Sequential

#	Provider	Score	Median TTI	P95	P99	Status
1	daytona	96.5	0.21s	0.56s	0.56s	10/10
2	vercel	96.2	0.35s	0.43s	0.43s	10/10
3	archil	96.0	0.25s	0.61s	0.61s	10/10
4	blaxel	95.1	0.46s	0.54s	0.54s	10/10
5	e2b	94.5	0.44s	0.71s	0.71s	10/10
6	runloop	87.4	1.19s	1.38s	1.38s	10/10
7	hopx	86.5	1.25s	1.51s	1.51s	10/10
8	modal	83.8	1.42s	1.90s	1.90s	10/10
9	cloudflare	78.3	2.03s	2.39s	2.39s	10/10
10	namespace	73.1	1.86s	3.94s	3.94s	10/10
11	codesandbox	37.1	3.81s	19.78s	19.78s	10/10

Staggered

#	Provider	Score	Median TTI	P95	P99	Status
1	archil	98.4	0.15s	0.18s	0.18s	10/10
2	blaxel	95.5	0.43s	0.49s	0.49s	10/10
3	e2b	95.3	0.41s	0.55s	0.55s	10/10
4	daytona	94.7	0.39s	0.75s	0.75s	10/10
5	vercel	94.5	0.35s	0.85s	0.85s	10/10
6	hopx	88.4	1.01s	1.38s	1.38s	10/10
7	modal	82.2	1.67s	1.95s	1.95s	10/10
8	namespace	81.1	1.86s	1.95s	1.95s	10/10
9	runloop	81.0	1.68s	2.23s	2.23s	10/10
10	cloudflare	78.2	1.98s	2.49s	2.49s	10/10
11	codesandbox	36.9	3.85s	20.67s	20.67s	10/10

Burst

#	Provider	Score	Median TTI	P95	P99	Status
1	archil	97.5	0.18s	0.36s	0.36s	10/10
2	daytona	96.5	0.23s	0.53s	0.53s	10/10
3	vercel	95.9	0.39s	0.44s	0.44s	10/10
4	e2b	95.1	0.35s	0.71s	0.71s	10/10
5	blaxel	95.0	0.48s	0.54s	0.54s	10/10
6	modal	83.8	1.50s	1.80s	1.80s	10/10
7	hopx	81.6	1.69s	2.08s	2.08s	10/10
8	namespace	80.2	1.88s	2.12s	2.12s	10/10
9	cloudflare	80.0	1.88s	2.18s	2.18s	10/10
10	runloop	69.6	2.83s	3.34s	3.34s	10/10
11	codesandbox	33.7	4.39s	20.34s	20.34s	10/10

View full run · SVGs available as build artifacts

github-actions · 2026-04-16T22:31:05Z

Storage Benchmark Results

1MB Files

#	Provider	Score	Download	Throughput	Upload	Status
1	AWS S3	95.8	0.04s	205.8 Mbps	0.05s	1000/1000
2	Tigris	95.5	0.05s	172.2 Mbps	0.14s	1000/1000
3	Cloudflare R2	94.3	0.12s	71.2 Mbps	0.21s	995/1000

4MB Files

#	Provider	Score	Download	Throughput	Upload	Status
1	AWS S3	97.3	0.06s	576.7 Mbps	0.22s	1000/1000
2	Tigris	97.0	0.07s	500.2 Mbps	0.19s	1000/1000
3	Cloudflare R2	93.9	0.24s	137.7 Mbps	0.49s	996/1000

10MB Files

#	Provider	Score	Download	Throughput	Upload	Status
1	AWS S3	97.4	0.12s	688.4 Mbps	0.50s	1000/1000
2	Tigris	93.3	0.49s	171.7 Mbps	0.90s	1000/1000
3	Cloudflare R2	93.0	0.41s	204.4 Mbps	1.44s	1000/1000

16MB Files

#	Provider	Score	Download	Throughput	Upload	Status
1	AWS S3	97.3	0.19s	718.0 Mbps	0.54s	1000/1000
2	Cloudflare R2	92.5	0.61s	219.8 Mbps	1.52s	1000/1000
3	Tigris	92.2	0.70s	192.1 Mbps	1.37s	999/1000

View full run · SVGs available as build artifacts

feat: add AI gateway benchmark mode and CI workflow

944a0d8

github-actions bot added 5 commits April 17, 2026 01:25

fix: use Cloudflare-compatible max completion token param

a7289b4

refactor: standardize AI gateway token limit field

bf4d241

fix: scope ai gateway workflow concurrency by event and ref

86e754d

fix: compute streaming throughput from usage tokens

250fce8

feat: include model in AI gateway benchmark output

2d92b4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add AI gateway benchmark mode#85

feat: add AI gateway benchmark mode#85
HeyGarrison wants to merge 6 commits intomasterfrom
feat/add-ai-gateway-benchmark-mode

HeyGarrison commented Apr 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HeyGarrison commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What This Tests

Out of Scope (Intentional)

Future Scope

Included In This PR

Validation

Uh oh!

github-actions bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Gateway Benchmark Results

SHORT NONSTREAM

SHORT STREAM

Uh oh!

github-actions bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sandbox Benchmark Results

Sequential

Staggered

Burst

Uh oh!

github-actions bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Storage Benchmark Results

1MB Files

4MB Files

10MB Files

16MB Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HeyGarrison commented Apr 16, 2026 •

edited

Loading

github-actions bot commented Apr 16, 2026 •

edited

Loading

github-actions bot commented Apr 16, 2026 •

edited

Loading

github-actions bot commented Apr 16, 2026 •

edited

Loading