TurboQuant: KV Cache Compression for LLMs

A from-scratch PyTorch implementation of TurboQuant (arXiv:2504.19874) for KV-cache compression on consumer GPUs.

Reproducible, Qwen-focused TurboQuant replication with paired baseline-vs-TurboQuant NIAH validation through 16K in the current release-check path.

Why This Matters

Long-context inference is often limited by KV-cache memory and memory bandwidth, not raw compute. If you can shrink KV-cache without breaking quality, you can run longer contexts cheaper and faster.

This project is a practical replication effort focused on that tradeoff.

Without KV compression:
  longer context -> much larger KV cache -> more VRAM pressure / memory traffic

With TurboQuant-style KV compression:
  longer context -> compressed KV cache -> lower memory footprint -> better practical scaling

What This Project Does

Compresses KV cache online during generation using a TurboQuant-style rotation + Lloyd-Max pipeline.
Preserves completion quality with compact settings (6-bit keys/values) for generation-heavy workloads.
Provides a retrieval-safe profile for long-context NIAH-style tasks.
Includes reproducible long-context paired evaluation tooling (baseline vs TurboQuant on identical prompts).

TurboQuant In One Diagram

Input token stream
      |
      v
Model computes K,V per layer/head
      |
      +--> Baseline cache: store fp16 K,V directly
      |
      +--> TurboQuant cache:
             K: norm -> unit normalize -> random rotate -> Lloyd-Max quantize -> indices (+ optional QJL fields)
             V: per-group min/max quantize -> indices + scales
             keep recent window uncompressed (buffer)
      |
      v
Attention step dequantizes on demand for logits/value mix

Current Validated Scope

Compression: 5.2x (target 4.5x)
Throughput: 61-84% of baseline depending on model/prompt length
Completion quality: matches baseline in tested suites
Retrieval quality (Qwen2.5-7B): NIAH gate met through 32K with retrieval-safe-v3
- Gate: baseline-vs-TurboQuant delta <= 2.0pp
- Observed at 32K paired matrix (t6): 1.39pp

Important scope note:

Full high-trial retrieval closure is currently Qwen-focused.
Mistral/Gemma retrieval checks are currently smoke-level and tracked as follow-up.

Claim Policy (Read This First)

What this repo currently claims:

Reproducible paired baseline-vs-TurboQuant evaluation for Qwen.
Practical retrieval closure gate for the current release-check scope (through 16K).
Reproducible throughput/compression comparisons with provided scripts.

What this repo does not claim yet:

Full benchmark parity with all paper/blog experiments (LongBench/RULER/vector-search).
Generalized low-bit near-lossless parity across all models and contexts.

See docs/PAPER_CLAIMS_STATUS.md and docs/PAPER_COMPARISON.md.

Bits: This Repo vs Original Research

Topic	Original paper/blog	This repo (current validated scope)
Key path concept	`TurboQuant_prod` uses MSE + 1-bit QJL residual correction	MSE-first path with optional QJL scaffolding
Reported effective bit settings	Commonly highlighted around 2.5/3.5-bit variants in paper benchmarks	Completion default: key=6, value=6
Retrieval-safe operating point	paper reports near-lossless at lower effective bits in its setup	Qwen paired NIAH through 32K: key=8, value=6, buffer=16384
Practical note	Theory + specialized benchmark setup	Engineering-focused, reproducible commands/artifacts in this repo

For an explicit mapping of paper experiments to repo coverage, see docs/PAPER_COMPARISON.md.

Requirements (Step-by-Step)

1) Hardware / runtime

NVIDIA GPU strongly recommended for full benchmarks.
For RTX 50-series Blackwell (sm_120), use PyTorch nightly cu128 builds.
VRAM:
- ~12-16 GB: practical for current Qwen scope (through 32K paired tests may still be slow/heavy)
- higher VRAM helps for larger context and multi-model sweeps

2) System dependencies

Python 3.12
pip
CUDA-compatible NVIDIA driver

3) Python packages

This repo uses requirements.txt:

torch
transformers
triton
bitsandbytes
datasets
einops
accelerate

4) Hugging Face access (if needed)

Qwen and Mistral are generally accessible directly.
Llama-3.1 requires accepted model terms + auth token.
If you plan to run gated models:

pip install -U "huggingface_hub[cli]"
huggingface-cli login

Quickstart

Quick Verify (10-20 min)

source venv312/bin/activate
python scripts/reproduce_release.py --mode quick --output-dir results/repro_quick

Outputs:

results/repro_quick/repro_report.json
results/repro_quick/repro_report.md

Full Verify (Qwen release-check path)

source venv312/bin/activate
python scripts/reproduce_release.py --mode full --output-dir results/repro_full

full mode is currently pinned to Qwen paired NIAH through 16K with higher trials.

1) Environment

git clone https://github.com/Taleef7/turboquant.git
cd turboquant
python -m venv venv312
source venv312/bin/activate
pip install -r requirements.txt

If you are on RTX 50-series/Blackwell, install PyTorch nightly cu128 first (then install requirements):

pip install --upgrade pip
pip install torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu128
pip install -r requirements.txt

2) Unit Tests (fast sanity)

pytest scripts/test_math.py scripts/test_kernels.py -v
pytest scripts/test_long_context_harness.py -q
pytest scripts/test_cache_config.py scripts/test_qjl_config.py -q

3) Baseline and TurboQuant demos

python scripts/run_baseline.py
python scripts/run_turboquant_v2.py

Reproducing Key Claims

A) Qwen paired NIAH retrieval gate (through 32K)

python scripts/test_long_context.py \
  --test niah \
  --mode paired \
  --model Qwen/Qwen2.5-7B-Instruct \
  --max-context 32768 \
  --key-bits 8 \
  --value-bits 6 \
  --buffer-size 16384 \
  --trials 6 \
  --output-prefix .tmp/niah_qwen_k8_v6_b16384_ctx32k_t6

This command directly compares baseline and TurboQuant on identical prompts and writes reproducible artifacts.

Expected outcome:

Delta (baseline-tq) at or under 2.0pp for this matrix
Most recent observed run: 1.39pp

For current release-check scope, use the one-command full path above (16K) and the checklist in docs/REPLICATION_CHECKLIST.md.

B) Multi-model completion benchmark

python scripts/test_multimodel.py --model qwen2.5-7b
python scripts/test_multimodel.py --model mistral-7b
python scripts/test_multimodel.py --model llama3.1-8b
python scripts/test_multimodel.py --model gemma2-9b

C) Throughput baseline-vs-TurboQuant (Qwen)

python scripts/benchmark_throughput.py

Retrieval-Safe Configuration

configs/retrieval_profile.json currently defines:

key_bits=8
value_bits=6
buffer_size=16384

Tradeoff:

Larger buffer_size improves retrieval robustness but increases uncompressed cache memory.

What Is Compared Today (for users)

Baseline vs TurboQuant completion behavior
Baseline vs TurboQuant paired NIAH retrieval (scripts/test_long_context.py --mode paired)
Throughput baseline vs TurboQuant (scripts/benchmark_throughput.py)

Not fully implemented yet for one-command parity with paper figures:

LongBench(-E) runner
RULER runner
full multi-method table generation (PQ/KIVI/etc.)

Use docs/PAPER_COMPARISON.md to see exactly what is covered vs deferred.

Project Layout

turboquant/
├── core/
│   ├── turboquant_cache_v2.py      # Main DynamicCache integration
│   └── turboquant_simple.py        # Quantizers (MSE path + optional QJL scaffolding)
├── codebooks/                      # Precomputed Lloyd-Max codebooks
├── configs/
│   └── retrieval_profile.json      # Retrieval-safe profile
├── scripts/
│   ├── test_long_context.py        # Paired NIAH harness
│   ├── test_long_context_harness.py# Harness unit tests
│   ├── test_multimodel.py          # Completion benchmark across models
│   ├── run_baseline.py
│   └── run_turboquant_v2.py
├── ISSUES.md                       # Local issue/status tracker
├── TESTING_RESULTS.md              # Detailed measured results
└── UPDATED_PLAN.md                 # Current execution/status plan

Known Limits / Deferred Work

Retrieval closure beyond 32K (64K/128K) is optional/deferred for now.
Full multi-model retrieval closure with higher trial counts is pending.
QJL path is scaffolded but not yet shown to improve paired NIAH in current integration.

References

Paper: TurboQuant (arXiv:2504.19874)
Current implementation notes: TESTING_RESULTS.md, UPDATED_PLAN.md, ISSUES.md
Paper comparison mapping: docs/PAPER_COMPARISON.md
Claims status matrix: docs/PAPER_CLAIMS_STATUS.md
Replication checklist: docs/REPLICATION_CHECKLIST.md
Contributor guide: CONTRIBUTING.md
Script index: scripts/README.md

Release Summary

This release stabilizes a reproducible, user-runnable TurboQuant workflow with:

clear baseline-vs-TurboQuant paired retrieval evaluation,
a documented retrieval-safe configuration (retrieval-safe-v3),
synchronized docs/status files describing validated scope and remaining optional work.

Practical outcome for current scope:

Qwen retrieval gate is closed through 32K with paired NIAH delta within target.

License

MIT - see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
codebooks		codebooks
configs		configs
core		core
docs		docs
kernels		kernels
scripts		scripts
utils		utils
.gitignore		.gitignore
2504.19874v1.md		2504.19874v1.md
CONTRIBUTING.md		CONTRIBUTING.md
ISSUES.md		ISSUES.md
LICENSE		LICENSE
README.md		README.md
SETUP_CUDA.md		SETUP_CUDA.md
TESTING_RESULTS.md		TESTING_RESULTS.md
UPDATED_PLAN.md		UPDATED_PLAN.md
WSL2_SETUP.md		WSL2_SETUP.md
requirements.txt		requirements.txt
setup_venv.sh		setup_venv.sh

Folders and files

Latest commit

History

Repository files navigation

TurboQuant: KV Cache Compression for LLMs

Why This Matters

What This Project Does

TurboQuant In One Diagram

Current Validated Scope

Claim Policy (Read This First)

Bits: This Repo vs Original Research

Requirements (Step-by-Step)

1) Hardware / runtime

2) System dependencies

3) Python packages

4) Hugging Face access (if needed)

Quickstart

Quick Verify (10-20 min)

Full Verify (Qwen release-check path)

1) Environment

2) Unit Tests (fast sanity)

3) Baseline and TurboQuant demos

Reproducing Key Claims

A) Qwen paired NIAH retrieval gate (through 32K)

B) Multi-model completion benchmark

C) Throughput baseline-vs-TurboQuant (Qwen)

Retrieval-Safe Configuration

What Is Compared Today (for users)

Project Layout

Known Limits / Deferred Work

References

Release Summary

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages