Skip to content

feat: add AI gateway benchmark mode#85

Open
HeyGarrison wants to merge 6 commits intomasterfrom
feat/add-ai-gateway-benchmark-mode
Open

feat: add AI gateway benchmark mode#85
HeyGarrison wants to merge 6 commits intomasterfrom
feat/add-ai-gateway-benchmark-mode

Conversation

@HeyGarrison
Copy link
Copy Markdown
Contributor

@HeyGarrison HeyGarrison commented Apr 16, 2026

Summary

  • add a new ai-gateway benchmark mode with provider configs, scenario-based runs (short-nonstream, short-stream), gateway-specific scoring, and JSON result output under results/ai_gateway/<scenario>/
  • wire ai-gateway into the main runner and merge pipeline so matrix artifacts can be combined via src/merge-results.ts --mode ai-gateway
  • add CLI scripts and a dedicated GitHub Actions workflow to run provider/scenario matrix benchmarks and post ranked PR comments

What This Tests

  • gateway transport performance and reliability using OpenAI-compatible POST /chat/completions
  • fixed prompt scenarios across providers with a shared model via AI_GATEWAY_MODEL
  • per-iteration timing and reliability metrics: first token latency, total latency, output token throughput, and success/error status
  • ranking via a latency-weighted composite score penalized by failures

Out of Scope (Intentional)

  • model answer quality/correctness evaluation
  • tool/function-calling behavior
  • long-context capability testing
  • cost/pricing benchmarking

Future Scope

  • gateway-specific platform features while keeping the same non-quality scope, e.g. retries/fallbacks, caching effects (cold vs warm), routing policy behavior, rate-limit handling, and concurrency/queueing behavior

Included In This PR

  • runtime and scoring implementation under src/ai-gateway/
  • runner integration in src/run.ts
  • merge integration in src/merge-results.ts
  • new scripts in package.json
  • new workflow: .github/workflows/ai-gateway-benchmarks.yml

Validation

  • ran npm run bench -- --mode ai-gateway --provider openrouter --iterations 1 (validated mode wiring and skip behavior when creds are missing)
  • ran npx tsx src/merge-results.ts --input /tmp/ai-gateway-merge-check --mode ai-gateway (validated merge-mode entrypoint)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 16, 2026

AI Gateway Benchmark Results

SHORT NONSTREAM

Model: openai/gpt-5.4

# Provider Score First Token Total Tok/sec Status
1 openrouter 99.6 0.23s 0.23s 21.4 50/50
2 vercel-ai-gateway 98.7 0.73s 0.73s 6.8 50/50
3 cloudflare-ai-gateway 98.3 0.77s 0.77s 5.2 50/50

SHORT STREAM

Model: openai/gpt-5.4

# Provider Score First Token Total Tok/sec Status
1 cloudflare-ai-gateway 98.7 0.57s 0.85s 56.4 50/50
2 openrouter 98.7 0.49s 0.85s 50.7 50/50
3 vercel-ai-gateway 98.3 0.60s 1.00s 42.1 50/50

View full run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 16, 2026

Sandbox Benchmark Results

Sequential

# Provider Score Median TTI P95 P99 Status
1 daytona 96.5 0.21s 0.56s 0.56s 10/10
2 vercel 96.2 0.35s 0.43s 0.43s 10/10
3 archil 96.0 0.25s 0.61s 0.61s 10/10
4 blaxel 95.1 0.46s 0.54s 0.54s 10/10
5 e2b 94.5 0.44s 0.71s 0.71s 10/10
6 runloop 87.4 1.19s 1.38s 1.38s 10/10
7 hopx 86.5 1.25s 1.51s 1.51s 10/10
8 modal 83.8 1.42s 1.90s 1.90s 10/10
9 cloudflare 78.3 2.03s 2.39s 2.39s 10/10
10 namespace 73.1 1.86s 3.94s 3.94s 10/10
11 codesandbox 37.1 3.81s 19.78s 19.78s 10/10

Staggered

# Provider Score Median TTI P95 P99 Status
1 archil 98.4 0.15s 0.18s 0.18s 10/10
2 blaxel 95.5 0.43s 0.49s 0.49s 10/10
3 e2b 95.3 0.41s 0.55s 0.55s 10/10
4 daytona 94.7 0.39s 0.75s 0.75s 10/10
5 vercel 94.5 0.35s 0.85s 0.85s 10/10
6 hopx 88.4 1.01s 1.38s 1.38s 10/10
7 modal 82.2 1.67s 1.95s 1.95s 10/10
8 namespace 81.1 1.86s 1.95s 1.95s 10/10
9 runloop 81.0 1.68s 2.23s 2.23s 10/10
10 cloudflare 78.2 1.98s 2.49s 2.49s 10/10
11 codesandbox 36.9 3.85s 20.67s 20.67s 10/10

Burst

# Provider Score Median TTI P95 P99 Status
1 archil 97.5 0.18s 0.36s 0.36s 10/10
2 daytona 96.5 0.23s 0.53s 0.53s 10/10
3 vercel 95.9 0.39s 0.44s 0.44s 10/10
4 e2b 95.1 0.35s 0.71s 0.71s 10/10
5 blaxel 95.0 0.48s 0.54s 0.54s 10/10
6 modal 83.8 1.50s 1.80s 1.80s 10/10
7 hopx 81.6 1.69s 2.08s 2.08s 10/10
8 namespace 80.2 1.88s 2.12s 2.12s 10/10
9 cloudflare 80.0 1.88s 2.18s 2.18s 10/10
10 runloop 69.6 2.83s 3.34s 3.34s 10/10
11 codesandbox 33.7 4.39s 20.34s 20.34s 10/10

View full run · SVGs available as build artifacts

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 16, 2026

Storage Benchmark Results

1MB Files

# Provider Score Download Throughput Upload Status
1 AWS S3 95.8 0.04s 205.8 Mbps 0.05s 1000/1000
2 Tigris 95.5 0.05s 172.2 Mbps 0.14s 1000/1000
3 Cloudflare R2 94.3 0.12s 71.2 Mbps 0.21s 995/1000

4MB Files

# Provider Score Download Throughput Upload Status
1 AWS S3 97.3 0.06s 576.7 Mbps 0.22s 1000/1000
2 Tigris 97.0 0.07s 500.2 Mbps 0.19s 1000/1000
3 Cloudflare R2 93.9 0.24s 137.7 Mbps 0.49s 996/1000

10MB Files

# Provider Score Download Throughput Upload Status
1 AWS S3 97.4 0.12s 688.4 Mbps 0.50s 1000/1000
2 Tigris 93.3 0.49s 171.7 Mbps 0.90s 1000/1000
3 Cloudflare R2 93.0 0.41s 204.4 Mbps 1.44s 1000/1000

16MB Files

# Provider Score Download Throughput Upload Status
1 AWS S3 97.3 0.19s 718.0 Mbps 0.54s 1000/1000
2 Cloudflare R2 92.5 0.61s 219.8 Mbps 1.52s 1000/1000
3 Tigris 92.2 0.70s 192.1 Mbps 1.37s 999/1000

View full run · SVGs available as build artifacts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant