Changelog

[0.1.0] — 2026-03-29

Highlights

Self-contained inference engine — loads Qwen3.5-0.8B, generates text at 14 tok/s on CPU
17x faster than PyTorch CPU, 1.4x faster than PyTorch on Apple GPU
Q8 weight quantization — 4x memory reduction (2.1 GB → 533 MB), -q flag
Streaming BF16 — embed/lm_head kept as mmap'd BF16, saves ~1 GB
Multi-threaded matmul — 4-thread pthread, 1.56x speedup
DeltaNet + Self-Attention — full Qwen3.5 hybrid architecture in C
HuggingFace BPE tokenizer — 248K vocab, encode/decode
KV cache quantized in inference — Q4 keys, integer Q4×Q8 attention

v0.8 Inference Engine

Integer-domain attention: 2.9-4.8x faster than FP32 on Apple Silicon (ARM NEON vdotq_s32)
Real model validated: Qwen3.5-0.8B KV cache, cosine 0.994 (A+)
8 quantization types including mixed precision outlier and RHT pre-rotation
K/V asymmetric: independent key/value bit allocation (K4V2 = 9.8x compression)
Community validated: r/LocalLLaMA findings integrated

Integer-Domain Attention (v0.7)

The single biggest performance breakthrough: instead of dequantizing Q4 keys to FP32, quantize the query to Q8 and compute integer dot products directly.

Before (v0.6): Q4 key → dequantize → FP32 dot = 0.49x vs FP32 (SLOWER)
After  (v0.7): Q4 key × Q8 query → integer dot = 2.9-4.8x vs FP32 (FASTER)

Fair NEON-vs-NEON benchmark (Apple M-series, median of 7 runs):

dim=128, seq=2048: FP32 22.8μs → Int Q4×Q8 7.8μs (2.9x)
dim=256, seq=2048: FP32 57.7μs → Int Q4×Q8 12.5μs (4.6x)
Larger head_dim benefits more (Q4 data fits in L1 cache)

Core Library

7 quantization types: PolarQuant (3/4b), QJL (1b), quant.cpp (3/4b), Uniform (2/4b)
Direct attention kernels: QJL Hamming distance, PolarQuant cos/sin LUT (no dequantization needed)
Self-contained block formats with ONNX-compliant LSB-first bit packing
O(1) type traits dispatch table (llama.cpp pattern)
Thread-safe API with pthread mutex (TSan verified)
Cross-platform math constants (TQ_PI/TQ_PI_2, no M_PI dependency)

Cache Management

Paged KV cache with block-table mapping (vLLM pattern)
Progressive compression: 3-tier automatic degradation by age, O(1) append
Copy-on-Write for beam search (ref_count based)
Value cache quantization and retrieval

Backends

CPU Generic (reference C11, zero external dependencies)
ARM NEON optimized (5.74x speedup over generic)
x86 AVX2 stubs ready for implementation
CUDA kernels: 7 files (polar, qjl, turbo, fused_cache, value, common, dispatch)
Metal compute shaders: 7 files (polar, qjl, turbo, fused_cache, value, common, dispatch)

Validation

A/B test: uniform_4b achieves cosine 0.995 vs FP16 — A+ grade, virtually lossless
Real model validation: cosine 0.991 on Qwen3.5-0.5B KV cache patterns (4 layers, 14 heads)
Per-layer analysis: quality consistent across depth (cosine >0.98 for uniform_4b)
Roundtrip MSE: 0.0014 (synthetic), 0.0025 (real model data)

Performance (Apple M-series ARM)

Quantize throughput: 1.4M elements/ms
Attention throughput: 137K queries/sec
Compression ratio: 7.53x (uniform_4b)
SIMD speedup: 4.0x (NEON vs generic)

Testing

13 C++ test suites (Google Test): polar, qjl, turbo, uniform, value, paged_cache, progressive, simd_neon, simd_avx2, threading, edge_cases, attention_all_types, llamacpp_integration
22 Python tests (unittest): bindings, roundtrip, attention, types
Total: 35 tests, 100% pass rate
Sanitizers: ASan + UBSan + TSan clean

Integration

llama.cpp: GGML type registration (7 types, base offset 256), CLI parser with 21 aliases, from_float/to_float/vec_dot wrappers, 10 integration tests
Python: ctypes bindings with NumPy support, pip installable (pip install -e .), quant.cpp class with quantize_keys/dequantize_keys/attention methods
vLLM: integration scaffold with README guide
Examples: minimal.c (10 lines), standalone.c, ab_test.c, demo_real_model.c, benchmark_types.cpp, python_quickstart.py, llamacpp_integration.cpp

Production Readiness (v0.4)

Integer overflow protection in size calculations
NULL pointer and buffer size validation on all public APIs
Edge case defense: seq_len=0, head_dim<2, odd dimensions
TQ_ERR_BUFFER_TOO_SMALL error code
tq_type_from_name() / tq_type_count() convenience functions
BPE values computed from actual struct sizes

Developer Experience

5-dimension scoring harness: structure/correctness/quality/performance/integration
Hierarchical Harness methodology (Karpathy AutoResearch + ClawTeam multi-agent)
Agent definitions (.claude/agents/): architect, core-dev, perf-dev, qa
Skill definitions (.claude/skills/): orchestrate, develop, score, qa
Slash commands (.claude/commands/): /score, /develop, /harness, /spawn-team, /merge-gate
PRD documents: v0.1 through v0.4
WBS documents: v0.1 through v0.4
refs/ absorption audit with checklist

Memory Impact

Model	Context	FP16 Cache	quant.cpp	Saved
Llama-3.2-3B	64K	7.00 GB	0.93 GB	87%
Qwen3.5-0.5B	128K	10.50 GB	2.79 GB	73%
Phi-3-mini	16K	6.00 GB	1.59 GB	73%

References

quant.cpp (ICLR 2026) — arXiv:2504.19874
QJL (AAAI 2025) — arXiv:2406.03482
PolarQuant (AISTATS 2026) — arXiv:2502.02617
Harness plugin (revfactory/harness) — agent team methodology

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changelog

[0.1.0] — 2026-03-29

Highlights

v0.8 Inference Engine

Integer-Domain Attention (v0.7)

Core Library

Cache Management

Backends

Validation

Performance (Apple M-series ARM)

Testing

Integration

Production Readiness (v0.4)

Developer Experience

Memory Impact

References

FilesExpand file tree

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

[0.1.0] — 2026-03-29

Highlights

v0.8 Inference Engine

Integer-Domain Attention (v0.7)

Core Library

Cache Management

Backends

Validation

Performance (Apple M-series ARM)

Testing

Integration

Production Readiness (v0.4)

Developer Experience

Memory Impact

References