LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.
-
Updated
Apr 15, 2026 - C
LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.
Based on the implementation of Google's TurboQuant (ICLR 2026) — Quansloth brings elite KV cache compression to local LLM inference. Quansloth is a fully private, air-gapped AI server that runs massive context models natively on consumer hardware with ease
TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.
Self-hosted auto clustering AI agent OS for low cost consumer hardware like the computer you have, an Orange or Raspberry Pi or a Mac etc. Desktop shell, app store, agent deployment, distributed compute cluster. Memory by taOSmd.
TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs
First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.
Unified KV cache compression for LLM inference — TurboQuant, IsoQuant, PlanarQuant, TriAttention. 10 methods, GPU-validated, multi-GPU planner. Compress KV cache 5-80x to run bigger models, longer context, more agents on your GPU.
Vector compression with TurboQuant codecs for embeddings, retrieval, and KV-cache. 10x compression, pure NumPy core — optional GPU acceleration via PyTorch (CUDA/MPS) or MLX (Metal).
Near-optimal vector quantization from Google's ICLR 2026 paper — 95% recall, 5x compression, zero preprocessing, pure Python FAISS replacement
TurboQuant: Native 3-Bit Quantization for Ollama - Achieve 25-28% better compression than Q4_0 while maintaining high-speed CPU inference. Experimentally integrated into Ollama with custom GGML kernels for LLM efficiency.
Minimal, zero-dependency LLM inference in pure C11. CPU-first with NEON/AVX2 SIMD. Flash MoE (pread + LRU expert cache). TurboQuant 3-bit KV compression (8.9x less memory per session). 20+ GGUF quant formats. Compiles to WASM.
HIP/ROCm fork optimized for AMD RDNA2 (gfx1030) with PrismML Q1_0_G128 1-bit quant support, RotorQuant, TurboQuant, EAGLE3 and P-EAGLE speculative decoding, and full Wave32 kernel optimizations.
Fused Triton kernels for TurboQuant KV cache compression — 2-4 bit quantization with RHT rotation. Drop-in HuggingFace & vLLM integration. Up to 4.9x KV cache compression for Llama, Qwen, Mistral, and more.
TurboQuant vector quantization for browser and edge runtimes
TurboQuant (ICLR 2026) ported to Apple Silicon — KV cache compression with MLX Metal kernels + PyTorch CPU
SIMD-accelerated 4/8-bit vector quantization for approximate nearest neighbor search, based on TurboQuant (ICLR 2026). Standalone C++17 library with Python bindings
A TurboQuant implementation with Llama.cpp for AMD with Vulkan runtime
AMD ROCm (gfx1030) inference fork with RotorQuant/TurboQuant KV compression, PHANTOM-X zero-copy draft speculation, EAGLE3 speculative decoding, 12 RDNA2 crash fixes, and PrismML Bonsai Q1_0_G128 1-bit GGUF support.
Native Windows build of vLLM 0.19.0 — no WSL, no Docker. Pre-built wheels + 33-file Windows patch + Multi-TurboQuant KV cache compression (6 methods, 2x cache capacity). PyTorch 2.10 + CUDA 12.6 + Triton + Flash-Attention 2.
Add a description, image, and links to the turboquant topic page so that developers can more easily learn about it.
To associate your repository with the turboquant topic, visit your repo's landing page and select "manage topics."