A high-performance heterogeneous inference engine for Large Language Models (LLMs) with CPU-GPU co-execution, featuring PagedAttention memory management and Continuous Batching scheduling.
Hetero-Paged-Infer is a Rust-based inference system designed for efficient LLM serving. It combines cutting-edge techniques from the vLLM project with a modular, production-ready architecture:
- PagedAttention - Virtual memory-inspired KV Cache management eliminates memory waste
- Continuous Batching - Dynamic request scheduling maximizes GPU utilization
- Heterogeneous Computing - CPU orchestrates while GPU computes
- Production-Grade - Comprehensive error handling, metrics, and logging
| Feature | Description | Status |
|---|---|---|
| PagedAttention KV Cache | Block-based memory management with O(1) lookup | ✅ Ready |
| Continuous Batching | Prefill/decode phase management with decode priority | ✅ Ready |
| Memory Pressure Awareness | Configurable thresholds prevent OOM | ✅ Ready |
| Modular Architecture | Trait-based abstractions for all components | ✅ Ready |
| CUDA Graph Support | Decode phase graph capture (planned) | 🚧 Planned |
┌─────────────────────────────────────────────────────────────────────────┐
│ InferenceEngine │
├─────────────────────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐ │
│ │ Tokenizer │ │ Scheduler │ │ KV Cache Manager │ │
│ │ (CPU) │ │ (CPU) │ │ (CPU) │ │
│ │ Encode/Decode│ │ Prefill/ │ │ BlockPool/PageTable │ │
│ │ │ │ Decode │ │ │ │
│ └──────┬───────┘ └──────┬───────┘ └───────────┬──────────────┘ │
│ │ │ │ │
│ │ ┌──────▼───────┐ │ │
│ │ │ Batch Builder│◄──────────────┘ │
│ │ │ (CPU) │ │
│ │ └──────┬───────┘ │
│ ───────┼───────────────────┼──────────────────────────────────────── │
│ │ ┌──────▼───────┐ │
│ │ │ GPU Executor │ │
│ │ │ (CUDA/GPU) │ │
│ ───────┼────────────┴──────────────┴──────────────────────────────── │
│ │ ┌──────▼───────┐ │
│ └───────────►│ KV Cache │ │
│ │ (GPU Memory) │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
- Rust 1.70+ (2021 edition)
- Linux environment (Ubuntu 20.04+ recommended)
- NVIDIA GPU with CUDA 11.x+ (optional, for GPU acceleration)
# Clone repository
git clone https://github.com/LessUp/hetero-paged-infer.git
cd hetero-paged-infer
# Build release version
cargo build --release
# Run tests
cargo test# Simple inference
./target/release/hetero-infer --input "Hello, world!" --max-tokens 50
# With custom parameters
./target/release/hetero-infer \
--input "Explain quantum computing" \
--max-tokens 200 \
--temperature 0.8 \
--top-p 0.95use hetero_infer::{EngineConfig, GenerationParams, InferenceEngine};
// Create engine
let config = EngineConfig::default();
let mut engine = InferenceEngine::new(config)?;
// Submit request
let params = GenerationParams {
max_tokens: 100,
temperature: 0.8,
top_p: 0.95,
};
let request_id = engine.submit_request("Hello, world!", params)?;
// Run inference
let completed = engine.run();
// Get results
for result in completed {
println!("Output: {}", result.output_text);
}| Option | Default | Description |
|---|---|---|
--block-size |
16 | Tokens per physical block |
--max-num-blocks |
1024 | Total physical blocks |
--max-batch-size |
32 | Max sequences per batch |
--memory-threshold |
0.9 | Memory pressure threshold |
--temperature |
1.0 | Sampling temperature |
--top-p |
0.9 | Nucleus sampling threshold |
Create config.json:
{
"block_size": 16,
"max_num_blocks": 1024,
"max_batch_size": 32,
"memory_threshold": 0.9
}Load with: ./hetero-infer --config config.json
| Resource | Description | Link |
|---|---|---|
| Architecture | System design and components | docs/en/ARCHITECTURE.md |
| API Reference | Rust API documentation | docs/en/API.md |
| Configuration | All configuration options | docs/en/CONFIGURATION.md |
| Deployment | Production deployment guide | docs/en/DEPLOYMENT.md |
| GitHub Pages | Online documentation | https://lessup.github.io/hetero-paged-infer/ |
- PagedAttention KV Cache management
- Continuous Batching scheduler
- Memory pressure awareness
- Modular trait abstractions
- Comprehensive property testing
- Mock GPU executor for testing
- Real CUDA kernel implementation
- Pinned memory management
- Copy-on-write KV sharing
- Async CPU/GPU overlap
Memory efficiency comparison:
| Approach | Memory Waste | Throughput |
|---|---|---|
| Static Allocation | ~40-60% | Baseline |
| Dynamic Allocation | ~20-30% | +20% |
| PagedAttention | <5% | +50% |
# Run all tests
cargo test
# Run with coverage
cargo tarpaulin --out Html
# Run property tests
cargo test -- --test-threads=1
# Run benchmarks
cargo benchTest coverage:
| Type | Count | Coverage |
|---|---|---|
| Unit Tests | 78 | Core modules |
| Property Tests | 15 | Invariant verification |
| Integration Tests | 13 | End-to-end flows |
| Doc Tests | 29 | API examples |
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
Quick steps:
# Fork and clone
git clone https://github.com/YOUR_USERNAME/hetero-paged-infer.git
# Create branch
git checkout -b feature/your-feature
# Make changes and test
cargo test
cargo fmt --check
cargo clippy
# Submit PR
git push origin feature/your-featureThis project is licensed under the MIT License.
- vLLM - PagedAttention concept
- Rust - Systems programming language
- NVIDIA CUDA - GPU computing platform
Made with ❤️ by the Hetero-Paged-Infer Team