- Self-contained inference engine — loads Qwen3.5-0.8B, generates text at 14 tok/s on CPU
- 17x faster than PyTorch CPU, 1.4x faster than PyTorch on Apple GPU
- Q8 weight quantization — 4x memory reduction (2.1 GB → 533 MB),
-qflag - Streaming BF16 — embed/lm_head kept as mmap'd BF16, saves ~1 GB
- Multi-threaded matmul — 4-thread pthread, 1.56x speedup
- DeltaNet + Self-Attention — full Qwen3.5 hybrid architecture in C
- HuggingFace BPE tokenizer — 248K vocab, encode/decode
- KV cache quantized in inference — Q4 keys, integer Q4×Q8 attention
- Integer-domain attention: 2.9-4.8x faster than FP32 on Apple Silicon (ARM NEON
vdotq_s32) - Real model validated: Qwen3.5-0.8B KV cache, cosine 0.994 (A+)
- 8 quantization types including mixed precision outlier and RHT pre-rotation
- K/V asymmetric: independent key/value bit allocation (K4V2 = 9.8x compression)
- Community validated: r/LocalLLaMA findings integrated
The single biggest performance breakthrough: instead of dequantizing Q4 keys to FP32, quantize the query to Q8 and compute integer dot products directly.
Before (v0.6): Q4 key → dequantize → FP32 dot = 0.49x vs FP32 (SLOWER)
After (v0.7): Q4 key × Q8 query → integer dot = 2.9-4.8x vs FP32 (FASTER)
Fair NEON-vs-NEON benchmark (Apple M-series, median of 7 runs):
- dim=128, seq=2048: FP32 22.8μs → Int Q4×Q8 7.8μs (2.9x)
- dim=256, seq=2048: FP32 57.7μs → Int Q4×Q8 12.5μs (4.6x)
- Larger head_dim benefits more (Q4 data fits in L1 cache)
- 7 quantization types: PolarQuant (3/4b), QJL (1b), quant.cpp (3/4b), Uniform (2/4b)
- Direct attention kernels: QJL Hamming distance, PolarQuant cos/sin LUT (no dequantization needed)
- Self-contained block formats with ONNX-compliant LSB-first bit packing
- O(1) type traits dispatch table (llama.cpp pattern)
- Thread-safe API with pthread mutex (TSan verified)
- Cross-platform math constants (TQ_PI/TQ_PI_2, no M_PI dependency)
- Paged KV cache with block-table mapping (vLLM pattern)
- Progressive compression: 3-tier automatic degradation by age, O(1) append
- Copy-on-Write for beam search (ref_count based)
- Value cache quantization and retrieval
- CPU Generic (reference C11, zero external dependencies)
- ARM NEON optimized (5.74x speedup over generic)
- x86 AVX2 stubs ready for implementation
- CUDA kernels: 7 files (polar, qjl, turbo, fused_cache, value, common, dispatch)
- Metal compute shaders: 7 files (polar, qjl, turbo, fused_cache, value, common, dispatch)
- A/B test: uniform_4b achieves cosine 0.995 vs FP16 — A+ grade, virtually lossless
- Real model validation: cosine 0.991 on Qwen3.5-0.5B KV cache patterns (4 layers, 14 heads)
- Per-layer analysis: quality consistent across depth (cosine >0.98 for uniform_4b)
- Roundtrip MSE: 0.0014 (synthetic), 0.0025 (real model data)
- Quantize throughput: 1.4M elements/ms
- Attention throughput: 137K queries/sec
- Compression ratio: 7.53x (uniform_4b)
- SIMD speedup: 4.0x (NEON vs generic)
- 13 C++ test suites (Google Test): polar, qjl, turbo, uniform, value, paged_cache, progressive, simd_neon, simd_avx2, threading, edge_cases, attention_all_types, llamacpp_integration
- 22 Python tests (unittest): bindings, roundtrip, attention, types
- Total: 35 tests, 100% pass rate
- Sanitizers: ASan + UBSan + TSan clean
- llama.cpp: GGML type registration (7 types, base offset 256), CLI parser with 21 aliases, from_float/to_float/vec_dot wrappers, 10 integration tests
- Python: ctypes bindings with NumPy support, pip installable (
pip install -e .), quant.cpp class with quantize_keys/dequantize_keys/attention methods - vLLM: integration scaffold with README guide
- Examples: minimal.c (10 lines), standalone.c, ab_test.c, demo_real_model.c, benchmark_types.cpp, python_quickstart.py, llamacpp_integration.cpp
- Integer overflow protection in size calculations
- NULL pointer and buffer size validation on all public APIs
- Edge case defense: seq_len=0, head_dim<2, odd dimensions
- TQ_ERR_BUFFER_TOO_SMALL error code
- tq_type_from_name() / tq_type_count() convenience functions
- BPE values computed from actual struct sizes
- 5-dimension scoring harness: structure/correctness/quality/performance/integration
- Hierarchical Harness methodology (Karpathy AutoResearch + ClawTeam multi-agent)
- Agent definitions (.claude/agents/): architect, core-dev, perf-dev, qa
- Skill definitions (.claude/skills/): orchestrate, develop, score, qa
- Slash commands (.claude/commands/): /score, /develop, /harness, /spawn-team, /merge-gate
- PRD documents: v0.1 through v0.4
- WBS documents: v0.1 through v0.4
- refs/ absorption audit with checklist
| Model | Context | FP16 Cache | quant.cpp | Saved |
|---|---|---|---|---|
| Llama-3.2-3B | 64K | 7.00 GB | 0.93 GB | 87% |
| Qwen3.5-0.5B | 128K | 10.50 GB | 2.79 GB | 73% |
| Phi-3-mini | 16K | 6.00 GB | 1.59 GB | 73% |
- quant.cpp (ICLR 2026) — arXiv:2504.19874
- QJL (AAAI 2025) — arXiv:2406.03482
- PolarQuant (AISTATS 2026) — arXiv:2502.02617
- Harness plugin (revfactory/harness) — agent team methodology