LLM benchmarking framework with SystemDS & Ollama & VLLM Backends - LDE Project#2431
Open
kubraaksux wants to merge 65 commits intoapache:mainfrom
Open
LLM benchmarking framework with SystemDS & Ollama & VLLM Backends - LDE Project#2431kubraaksux wants to merge 65 commits intoapache:mainfrom
kubraaksux wants to merge 65 commits intoapache:mainfrom
Conversation
Generic LLM benchmark suite for evaluating inference performance across different backends (vLLM, Ollama, OpenAI, MLX). Features: - Multiple workload categories: math (GSM8K), reasoning (BoolQ, LogiQA), summarization (XSum, CNN/DM), JSON extraction - Pluggable backend architecture for different inference engines - Performance metrics: latency, throughput, memory usage - Accuracy evaluation per workload type - HTML report generation This framework can be used to evaluate SystemDS LLM inference components once they are developed.
- Connection.java: Changed loadModel(modelName) to loadModel(modelName, workerScriptPath) - Connection.java: Removed findPythonScript() method - LLMCallback.java: Added Javadoc for generate() method - JMLCLLMInferenceTest.java: Updated to pass script path to loadModel()
- Connection.java: Auto-find available ports for Py4J communication - Connection.java: Add loadModel() overload for manual port override - Connection.java: Use destroyForcibly() with waitFor() for clean shutdown - llm_worker.py: Accept python_port as command line argument
Move worker script from src/main/python/systemds/ to src/main/python/ to avoid shadowing Python stdlib operator module.
- Add generateWithTokenCount() returning JSON with input/output token counts - Update generateBatchWithMetrics() to include input_tokens and output_tokens columns - Add CUDA auto-detection with device_map=auto for multi-GPU support in llm_worker.py - Check Python process liveness during startup instead of blind 60s timeout
- Fix duplicate accuracy computation in runner.py - Add --model flag and error handling to run_all_benchmarks.sh - Fix ttft_stats and timing_stats logic bugs - Extract shared helpers into scripts/utils.py - Add HuggingFace download fallback to all loaders - Fix reasoning accuracy false positives with word-boundary regex - Pin dependency versions in requirements.txt - Clean up dead code and unify config keys across backends - Fix README clone URL and repo structure
- Use real token counts from Ollama/vLLM APIs, omit when unavailable - Correct TTFT and cost estimates - Add --gpu-hour-cost and --gpu-count flags for server benchmarks
- 121 unit tests for all accuracy checkers, loaders, and metrics - ROUGE-1/2/L scoring for summarization (replaces quality-gate heuristic) - Concurrent request benchmarking with --concurrency flag - GPU profiling via pynvml - Real TTFT for MLX backend via stream_generate - Backend factory pattern and config validation - Proper logging across all components - Updated configs to n_samples=50
Replace declare -A (bash 4+ only) with a case function for default model lookup. macOS ships with bash 3.x.
- New embeddings workload using STS-Benchmark from HuggingFace - Model rates semantic similarity between sentence pairs (0-5 scale) - 21 new tests for score extraction, accuracy check, sample loading - Total: 142 tests passing across 5 workloads
- Add electricity + hardware amortization cost estimation to runner (--power-draw-w, --electricity-rate, --hardware-cost flags) - Fix aggregate.py cost key mismatch (api_cost_usd vs cost_total_usd) - Add compute cost columns to CSV output and HTML report - Update README with cost model documentation and embeddings workload
Include all 10 benchmark runs (5 OpenAI + 5 Ollama, 50 samples each) with metrics, samples, configs, HTML report, and aggregated CSV.
- 5 workloads x 2 models on NVIDIA H100 PCIe via vLLM - Mistral-7B-Instruct-v0.3: strong reasoning (68%), fast embeddings (129ms) - Qwen2.5-3B-Instruct: best embeddings accuracy (90%), 75ms latency - Compute costs reflect H100 electricity (350W) + hardware amortization - Regenerated summary.csv and benchmark_report.html with all 20 runs
- Connection.java: Changed loadModel(modelName) to loadModel(modelName, workerScriptPath) - Connection.java: Removed findPythonScript() method - LLMCallback.java: Added Javadoc for generate() method - JMLCLLMInferenceTest.java: Updated to pass script path to loadModel()
- Connection.java: Auto-find available ports for Py4J communication - Connection.java: Add loadModel() overload for manual port override - Connection.java: Use destroyForcibly() with waitFor() for clean shutdown - llm_worker.py: Accept python_port as command line argument
Move worker script from src/main/python/systemds/ to src/main/python/ to avoid shadowing Python stdlib operator module.
- Add generateWithTokenCount() returning JSON with input/output token counts - Update generateBatchWithMetrics() to include input_tokens and output_tokens columns - Add CUDA auto-detection with device_map=auto for multi-GPU support in llm_worker.py - Check Python process liveness during startup instead of blind 60s timeout
Integrate SystemDS as a benchmark backend using the JMLC API. All prompts are processed through PreparedScript.generateBatchWithMetrics() which returns results in a typed FrameBlock with per-prompt timing and token metrics. Benchmark results for 4 workloads with distilgpt2 on H100.
Run the embeddings (semantic similarity) workload with SystemDS JMLC, bringing SystemDS to 5 workloads matching all other backends.
Run all 5 workloads with Qwen/Qwen2.5-3B-Instruct through the SystemDS JMLC backend, replacing the distilgpt2 toy model. This enables a direct apples-to-apples comparison with vLLM Qwen 3B: same model, different serving path (raw HuggingFace via JMLC vs optimized vLLM inference).
Replace distilgpt2 toy model with same models used by vLLM backends: - SystemDS + Qwen 3B (5 workloads) vs vLLM + Qwen 3B - SystemDS + Mistral 7B (5 workloads) vs vLLM + Mistral 7B All runs include compute cost flags (350W, $0.30/kWh, $30k hardware). Increase JMLC worker timeout from 60s to 300s for larger models.
7B+ models need more time to load weights into GPU memory.
This file was accidentally modified in a prior commit. Restoring the original vectorized SIMD implementation.
18cecdd to
0f6f4af
Compare
vLLM results with 4 concurrent requests showing 5-8x throughput improvement and 80-88% per-query cost reduction compared to sequential processing. Also fix crash when model outputs non-dict JSON in json_extraction evaluator.
This file was accidentally modified in a prior commit. Restoring the original vectorized SIMD implementation.
0f6f4af to
85bfa93
Compare
Replace sequential per-prompt inference with true GPU batching: - LLMCallback.java: add generateBatch() for batched inference - PreparedScript.java: call generateBatch() instead of per-prompt loop - llm_worker.py: implement batched tokenization and model.generate() Results (50 samples per workload, NVIDIA H100): - Qwen 3B: 3-12x speedup (math 22s->1.9s, embeddings 144ms->49ms) - Mistral 7B: 7-14x speedup (json 5.4s->388ms, embeddings 380ms->28ms) - Batched SystemDS now faster than sequential vLLM on most workloads - Accuracy comparable (within statistical noise, n=50)
- LLMCallback.java: add generateBatch() interface method - PreparedScript.java: replace per-prompt for-loop with single batch call - llm_worker.py: implement batched tokenization and model.generate() Achieves 3-14x speedup over sequential inference on H100.
PreparedScript.generateBatchWithMetrics() now accepts a boolean batched parameter: true for GPU-batched inference (new), false for the original sequential for-loop. Defaults to batched=true. systemds_backend.py passes the batched flag from config so benchmark runs can select either mode.
generateBatchWithMetrics() now accepts a boolean batched parameter: true for GPU-batched (new), false for original sequential for-loop.
# Conflicts: # .gitignore # src/test/java/org/apache/sysds/test/functions/jmlc/JMLCLLMInferenceTest.java
- Use proper imports instead of inline fully-qualified class names - Add try-with-resources for HTTP streams to prevent resource leaks - Add connect/read timeouts to HTTP calls - Add lineage tracing support for llmPredict - Add checkInvalidParameters validation in parser - Remove leftover Py4J code from Connection/PreparedScript - Delete LLMCallback.java - Remove .claude/.env/meeting_notes from .gitignore - Trim verbose docstrings
- Use proper imports instead of inline fully-qualified class names - Add try-with-resources for HTTP streams to prevent resource leaks - Add connect/read timeouts to HTTP calls - Add lineage tracing support for llmPredict - Add checkInvalidParameters validation in parser - Remove .claude/.env/meeting_notes from .gitignore - Trim verbose docstrings
Supports parallel HTTP calls to the inference server via ExecutorService. Default concurrency=1 keeps sequential behavior.
# Conflicts: # src/main/java/org/apache/sysds/parser/ParameterizedBuiltinFunctionExpression.java # src/main/java/org/apache/sysds/runtime/instructions/cp/ParameterizedBuiltinCPInstruction.java
- Delete Py4J-based benchmark results (will re-run with llmPredict) - Remove license header from test (Matthias will add) - Clarify llm_server.py docstring
JMLC requires the LHS variable name in read() assignments to match the input name registered in prepareScript(). Changed X/R to prompts/results so RewriteRemovePersistentReadWrite correctly converts persistent reads to transient reads.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Benchmarking framework that compares LLM inference across four backends: OpenAI API, Ollama, vLLM, and a new SystemDS JMLC backend. Evaluated on 5 workloads (math, reasoning, summarization, JSON extraction, embeddings) with 55 total benchmark runs on NVIDIA H100.
Purpose and motivation
This project was developed as part of the LDE (Large-Scale Data Engineering) course. The goal is to evaluate how SystemDS — a system designed for large-scale data processing — can be extended to support LLM inference, and how its performance compares to established LLM serving solutions.
Research questions:
Approach:
Connection.javafor model lifecycle,PreparedScript.javafor batch inference via FrameBlock,llm_worker.pyfor HuggingFace model executionKey findings (summary):
Table of contents
Project structure
JMLC API extension (also in #2430):
Backends
model.generate()Workloads and datasets
openai/gsm8kgoogle/boolqEdinburghNLP/xsummteb/stsbenchmark-stsAll workloads use
temperature=0.0(deterministic generation) to ensure reproducible results. Each run processes 50 samples.ROUGE scoring (summarization workload):
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a standard metric for evaluating text summarization. It measures the overlap between a generated summary and a reference summary.
Example: if the reference is "The mayor announced a new park in downtown" and the model produces "A new park will be built in the downtown area", ROUGE-1 counts shared words like "new", "park", "downtown" and computes the F1 from precision and recall of those matches.
We use ROUGE-1 F1 ≥ 0.2 as the accuracy threshold: a prediction passes if it has meaningful overlap with the reference summary. This is standard in summarization evaluation (e.g., used in the CNN/DailyMail and XSum benchmarks).
How measurements work
The runner (
runner.py) takes a backend, workload config, and output directory:Per-run outputs:
samples.jsonl— per-sample predictions, references, latency, correctnessmetrics.json— aggregated latency stats (mean, p50, p95, cv), throughput, accuracy, costrun_config.json— full configuration for reproducibilitymanifest.json— file checksumsMetrics collected:
n / total_wall_clock_seconds(fair across sequential and concurrent modes)Run distribution
55 total benchmark runs organized in 3 result directories across 3 experimental phases.
results/— 35 baseline runs (Phase 1, all sequential, concurrency=1):results_c4/— 10 optimization runs (Phase 2, vLLM with concurrency=4):results_batch/— 10 optimization runs (Phase 3, SystemDS with GPU batching):Grand total: 35 baseline + 10 concurrent + 10 batched = 55 runs
All runs: 50 samples each, NVIDIA H100 PCIe (81GB),
temperature=0.0.Phase 1: Sequential baseline (concurrency=1)
35 runs on NVIDIA H100, 50 samples each. All backends process one prompt at a time.
Accuracy (% correct):
vLLM and SystemDS run the same models on the same GPU, so accuracy is comparable. Small differences (±4%) are within statistical noise for n=50.
Latency (p50, median per-prompt response time):
SystemDS is 2-5x slower than vLLM with the same model. vLLM uses an optimized serving engine (PagedAttention, CUDA kernels), while SystemDS calls standard HuggingFace
model.generate()through a Py4J bridge. The gap reflects inference engine optimization, not just IPC overhead.Cost per query (API + compute):
Phase 2: Concurrency experiment (vLLM, concurrency=4)
Identified that the sequential baseline represents worst-case throughput. Re-ran vLLM with 4 concurrent requests using
ThreadPoolExecutorto measure parallel processing gains.Throughput improvement (requests/second):
vLLM scales 5-8x with concurrency=4 because its engine collects concurrent requests into GPU batches via continuous batching. This raised the question: can SystemDS achieve similar gains?
SystemDS could not use the same
--concurrency=4approach because the JMLC backend uses a single Py4J worker — concurrent requests would serialize on the same Python process. A different optimization strategy was needed.Phase 3: GPU batching optimization for SystemDS
Root cause analysis:
The original
PreparedScript.generateBatchWithMetrics()contained a Javafor-loop that called the Python worker once per prompt:Each
model.generate()call used the GPU for a single prompt, leaving most of the GPU's parallel compute capacity idle.What we changed (3 files):
LLMCallback.java: AddedgenerateBatch(String[] prompts, ...)to the Py4J callback interface, accepting an array of prompts instead of a single string.PreparedScript.java:generateBatchWithMetrics()now accepts aboolean batchedparameter:batched=true(default): passes all prompts to Python in one Py4J callbatched=false: uses the original sequential for-loop (preserved for reproducibility)llm_worker.py: AddedgenerateBatch()method that:tokenizer(batch, padding=True))model.generate()on the padded batchWhy sub-batch size 8? This is a GPU memory trade-off. Each prompt in a batch requires its own KV-cache allocation during generation. With padding to the longest prompt in the sub-batch, memory usage scales as
batch_size × max_sequence_length × model_dimensions. For 7B models on an 81GB H100, sub-batch=8 keeps peak memory well within limits while still utilizing GPU parallelism effectively. Larger sub-batches (16, 32) would risk OOM on longer workloads like math (512 max tokens). Smaller sub-batches (2, 4) would leave GPU compute underutilized. 8 is a practical sweet spot — not tuned for a specific comparison, but chosen for reliable execution across all workloads and models.Throughput comparison (requests/second, measured as n / total_wall_clock):
Total wall-clock time for 50 prompts (Qwen 3B):
Total wall-clock time for 50 prompts (Mistral 7B):
Accuracy (sequential vs batched SystemDS):
Mistral 7B accuracy is stable across modes. Qwen 3B shows some variation (up to ±10%), likely because padding changes the attention patterns for smaller models. The direction is inconsistent (some up, some down), suggesting statistical noise rather than systematic degradation. With n=50 and
temperature=0.0, these differences are within the expected range.Architecture and batching details
Why batched SystemDS beats sequential vLLM (2-4x):
Sequential vLLM processes one HTTP request at a time. Each request carries overhead: HTTP parsing, vLLM scheduler dispatch, memory allocation, response serialization. The GPU processes a single sequence, then waits for the next request.
Batched SystemDS skips all server overhead (direct Py4J call) and sends 8 prompts to the GPU at once. The GPU's tensor cores process multiple sequences in parallel — 8 prompts take roughly 2x the time of 1, not 8x. This GPU-level parallelism is the key advantage.
Why vLLM concurrent=4 still beats batched SystemDS (2-3x):
vLLM's serving engine is purpose-built for throughput:
SystemDS uses standard HuggingFace
model.generate()which lacks these optimizations. The remaining 2-3x gap reflects the difference between a general-purpose inference API and a specialized serving engine.Could we close the remaining gap? Partially. Possible future improvements:
model.generate()with optimized backends (vLLM as a library, TensorRT-LLM, or Flash Attention)These would bring SystemDS throughput closer to vLLM but would significantly increase complexity.
Conclusions
Accuracy: OpenAI (gpt-4.1-mini) leads on most tasks. Among local models, Mistral 7B excels at reasoning (74%) while Qwen 3B is stronger on math (72%) and embeddings (90%). vLLM and SystemDS produce comparable accuracy since they run the same models.
The sequential bottleneck was real: The original SystemDS JMLC path was 2-5x slower than vLLM because it processed each prompt in a separate GPU call through a Java for-loop and Py4J bridge.
GPU batching closes most of the gap: By tokenizing prompts together and running
model.generate()on batches of 8, SystemDS achieved 3-12x speedup and now outperforms sequential vLLM by 2-4x.vLLM's serving engine still wins for production: With concurrency=4, vLLM is 2-3x faster than batched SystemDS. The gap is due to PagedAttention, continuous batching, and custom CUDA kernels — optimizations that go beyond what HuggingFace's standard
model.generate()provides.Cost scales with throughput: Faster inference = less GPU time per query = lower cost. Batched SystemDS reduces per-query cost by 80-90% compared to sequential, making local inference cost-competitive with OpenAI API.
The FrameBlock API provides a clean abstraction: Both sequential and batched modes return the same structured columnar output (
[prompt, generated_text, time_ms, input_tokens, output_tokens]), controlled by a singlebatchedboolean parameter. All results are preserved for reproducibility.Reproducibility
Both inference modes are preserved in the code. To reproduce:
Sequential results were collected with the original for-loop code (git history preserves this). Batched results were collected after the GPU batching optimization. The
batchedparameter inPreparedScript.generateBatchWithMetrics()defaults totruebut can be set tofalseto reproduce sequential behavior.