MineDraft accelerates large language model inference by overlapping the drafting and verification stages of speculative decoding, hiding latency and unlocking substantial throughput gains in batch settings.
| Metric | Improvement over Standard SD |
|---|---|
| π Throughput | up to +75% |
| β‘ End-to-end Latency | up to β39% |
Speculative decoding (SD) uses a small draft model to propose candidate tokens that a larger target model then verifies β reducing the number of expensive forward passes. MineDraft leads this paradigm to parallel execution by overlapping the drafting and verification stages so that drafting latency is effectively hidden behind verification compute.
Experiments across Qwen3, Llama-3.3, and EAGLE models validate MineDraft's gains on ShareGPT, LMSYS Arena, and Spec-Bench benchmarks.
- Requirements
- Installation
- Dataset Preparation
- Configuration
- Running Experiments
- Results & Analysis
- Troubleshooting
- Citation
| Component | Requirement |
|---|---|
| OS | Linux (tested: Ubuntu 22.04) |
| Python | 3.9 β 3.12 (tested: 3.12) |
| CUDA | β₯ 11.8 (tested: 12.8) |
| GPUs | 5Γ NVIDIA with sufficient VRAM (A100 80GB / H100 / L40 recommended) |
| Package | Version |
|---|---|
| vLLM | 0.9.2 |
| PyTorch | 2.7.0 |
| torch-scatter | 2.1.2 |
Step 1 β Create a virtual environment (pick one):
venv
python -m venv venv
source venv/bin/activateuv
uv venv --python 3.12 --seed
source venv/bin/activateconda
conda create -n minedraft python=3.12 -y
conda activate minedraftStep 2 β Install vLLM:
pip install vllm==0.9.2 --extra-index-url https://download.pytorch.org/whl/cu128Step 3 β Install MineDraft:
pip install -e ".[benchmark]"This installs:
- Core:
torch-scatter==2.1.2 - Benchmark:
datasets,nvitop,pandas,numpy,matplotlib,IPython,tqdm
mkdir -p benchmarks/datasets
python scripts/convert_datasets.py| Output File | Source |
|---|---|
ShareGPT.json |
ShareGPT_V3_unfiltered_cleaned_split |
arena.json |
LMSYS Chatbot Arena Conversations |
spec_bench.json |
Spec-Bench |
tough.json |
Domain-specific tough questions |
Experiments use various speculative decoding configurations set via --speculative-config:
| Mode | GPUs | Layout |
|---|---|---|
| Parallel | 5 | 4 for target model TP + 1 for draft model |
| Sequential | 4 | All 4 for target model TP; drafter shares resources |
| Script | Model Setup |
|---|---|
experiment_1_*.sh |
Qwen3-32B with draft models (0.6B, 1.7B, 4B) |
experiment_2_eagle_*.sh |
EAGLE β Vicuna-33B, Vicuna-13B |
experiment_2_llama_*.sh |
Llama-3.3-70B-AWQ with Llama-3.1-8B |
experiment_3_n_*.sh |
Multi-sample ablation |
experiment_4_bs_*.sh |
Batch size ablation (8, 16, 32, 64) |
experiment_5_tetris_*.sh |
Tetris VSR analysis |
experiment_6_qwen8b.sh |
Qwen3-32B with Qwen3-8B |
experiment_7_qwen235b.sh |
Qwen3-235B-A22B-FP8 with Qwen3-14B |
experiment_8_nsys.sh |
NVIDIA Nsight Systems profiling |
Each experiment ships with two variants: *_parallel.sh (5 GPUs) and *_sequential.sh (4 GPUs).
cd scripts
bash run_all.sh # parallel + sequential
bash run_parallel.sh # parallel only
bash run_sequential.sh # sequential onlycd scripts
bash experiment_1_parallel.sh # Qwen3-32B parallel
bash experiment_2_eagle_sequential.sh # EAGLE sequentialUseful on shared clusters β waits for GPUs to become free before launching:
# First, comment out the `export CUDA_VISIBLE_DEVICES=` line in the target script, then:
python scripts/bootstrap.py bash scripts/experiment_1_parallel.shThe bootstrap script monitors GPU availability, waits until 5 GPUs are free (<1% memory & utilization), then sets CUDA_VISIBLE_DEVICES and launches. You can adjust required GPU count and thresholds in the main function.
| Artifact | Location |
|---|---|
| Benchmark traces | benchmarks/trace/*.jsonl |
| Nsight Systems profiling reports | *.nsys-rep (project root) |
| Trace analysis notebook | benchmarks/trace/analyze_plots.ipynb |
| Trace analysis utilities | benchmarks/trace/analyze_traces.py |
Out of Memory (OOM)
- Reduce
--gpu-memory-utilization(default:0.65) - Reduce
--max-num-seqs(batch size) - Switch to a smaller draft or target model
CUDA Version Mismatch
Verify your CUDA installation:
nvcc --version
nvidia-smiMineDraft requires CUDA β₯ 12.8 for the tested configuration.
Model Download Issues
Models are automatically downloaded from HuggingFace. Ensure you have:
- Sufficient disk space or quota
- HuggingFace access tokens for gated models (e.g., Llama)
For downloading gated models, run:
huggingface-cli loginNVIDIA Nsight Systems β Wrong event order error
If you see:
Wrong event order has been detected when adding events to the collection
Upgrade to Nsight Systems β₯ 2024.2 from the NVIDIA developer portal.
If you find MineDraft useful in your research, please cite:
@article{tang2026minedraft,
title = {MineDraft: A Framework for Batch Parallel Speculative Decoding},
author = {Tang, Zhenwei and Verma, Arun and Zhou, Zijian and Wu, Zhaoxuan
and Prakash, Alok and Rus, Daniela and Low, Bryan Kian Hsiang},
journal = {arXiv preprint arXiv:2603.18016},
year = {2026}
}
{ "method": null, // null = standard SD | "eagle" = EAGLE "model": "<draft_model>", // HuggingFace model ID for draft model "draft_tensor_parallel_size": 1, // TP size for draft model (always 1) "num_speculative_tokens": 5, // Number of draft tokens (k) "is_parallel": true, // Enable PSD (and MineDraft) "force_pearl": false, // Enable PEARL if is_parallel is true (disables MineDraft) "tetris": true, // Enable Tetris "tetris_turn_on_batch_size": 1, // Batch size threshold to activate Tetris "tetris_capacity": 0, // Tetris capacity, 0 β auto calculated from k Γ max_num_seqs "tetris_extra_proposals": 3 // Extra draft tokens for Tetris }