Objective
Create a benchmarking suite that systematically compares popular LLMs across llama.cpp and OpenVINO GenAI, spanning multiple runtimes and hardware backends (CPU/GPU/NPU). The goal is to establish reproducible benchmarks that capture performance profiles and hardware utilization, helping the community understand model and backend strengths.
Summary of Proposed Benchmark
- Models: ~15 leading LLMs, including Llama, Qwen, DeepSeek, Phi, Gemma,etc .
- Frameworks:
- OpenVINO GenAI (IR & GGUF)
- llama.cpp (GGUF, Q4_0)
- Benchmarks run via OpenVINO GenAI
llm_bench and llama.cpp llama-bench, covering backends: llama.cpp default CPU, Vulkan, and OpenVINO CPU/GPU/NPU.
- Metrics: Load/compile times, prompt evaluation speed, TTFT, token generation speed, memory use, quantization config, plus hardware and software details.
- Output: Tabular benchmarking results, observations, and reproducibility instructions/scripts.
Intent
- Help guide users on model/framework/device best practices
- Expose any gaps or optimization opportunities
- Build a resource for others to contribute/compare performance
Objective
Create a benchmarking suite that systematically compares popular LLMs across llama.cpp and OpenVINO GenAI, spanning multiple runtimes and hardware backends (CPU/GPU/NPU). The goal is to establish reproducible benchmarks that capture performance profiles and hardware utilization, helping the community understand model and backend strengths.
Summary of Proposed Benchmark
llm_benchand llama.cppllama-bench, covering backends: llama.cpp default CPU, Vulkan, and OpenVINO CPU/GPU/NPU.Intent