Config-driven asymmetric quantization for DeepSeek-V4-Flash.
forgequant turns a small JSON recipe — per-tensor-family quant types, an
importance matrix (imatrix), and optionally a per-layer expert boost — into a
quantized GGUF, reproducibly, with a manifest. It wraps ds4's
deepseek4-quantize (plus ds4 --imatrix-dataset, ds4-server --imatrix-out and
the GGUF splicer), so you stop hand-assembling long quantizer command lines.
The point: a model's 2-bit budget can be spent asymmetrically, at three depths —
- family — keep attention/shared/output near-lossless (Q8), push the routed experts to 2 bits;
- expert — an imatrix re-allocates those 2 bits inside every tensor toward the experts your workload actually activates (ds4 records per-(layer, expert) activation statistics);
- layer —
boostupcasts the routed experts of the layers your workload lives in (e.g. Q4_K on the 6 hottest), via--tensor-typeoverrides — orsplicecopies them from a donor GGUF in minutes, without requantizing.
forgequant makes that recipe a file you can version, diff, and re-run — and gives you the tools to see the activation paths before you spend the bits.
Specific to DeepSeek-V4-Flash and ds4's quantizer — not a general GGUF tool.
Python 3.8+ (standard library only; numpy used opportunistically if present). You
need a built ds4 checkout (provides ds4, ds4-server and
gguf-tools/deepseek4-quantize), the FP model source, and a template GGUF.
git clone --recursive https://github.com/andreaborio/forgequant.git # --recursive pulls benchy
# already cloned? grab the benchy submodule:
git submodule update --init
export DS4_DIR=~/BEEP/ds4 # your ds4 checkout (default ~/ds4)
export MODELS_DIR=~/BEEP/ds4-models # models/imatrices ({models} in recipes)benchy (the benchmark source) is vendored as a git submodule, so a recursive
clone is self-contained. git submodule update --remote benchy bumps it to
benchy's latest registry.
python3 forgequant.py list # available recipes
python3 forgequant.py show coder-q4boost # resolved recipe + EXACT commands (nothing runs)
python3 forgequant.py verify coder-q4boost # preflight: paths, imatrix, disk space
python3 forgequant.py build coder-q4boost # full pipeline: imatrix (if missing) -> quantize<recipe> is a preset name (recipes/<name>.json) or a path to your own .json.
The imatrix is the activation-path record that steers the bits. Pick your source:
# 1. from REAL benchmarks (the questions a domain expert faces) — fetched from benchy
python3 forgequant.py build coder-q4boost # the `bench` block builds the corpus first
# 2. from a corpus you already have (rendered prompt dataset)
python3 forgequant.py imatrix medical-iq2
# 3. from ANY raw prompt list — render it first, no ds4 python tooling needed
python3 forgequant.py render my_prompts.txt -o coder_corpus.txt # .txt or .jsonl
python3 forgequant.py imatrix coder-q4boost
# 4. from LIVE inference — serve the model, use it for real, Ctrl-C when done
python3 forgequant.py capture coder-q4boost --port 8000capture wraps ds4-server --imatrix-out: it records only aggregate per-expert
activation statistics from your real traffic — no prompt text is ever stored
(see ds4's ONEDGE_IMATRIX.md), and snapshots are written periodically. Ctrl-C is a
graceful stop: forgequant waits for ds4-server to flush its final snapshot.
Calibrate a domain imatrix on the questions a domain expert actually faces. Benchmarks come from benchy (github.com/andreaborio/benchy), vendored as a git submodule so everyone who clones forgequant gets the same source — a registry of real, non-saturated evals (MMLU-Pro, SuperGPQA, HumanEval, MBPP, MedXpertQA, MedQA, …) fetched live from the HuggingFace datasets-server and normalized.
git submodule update --init # first time: pull benchy in
python3 forgequant.py bench list # the registry (current vs saturated)
python3 forgequant.py bench bundles # domain bundles: code / medical / reasoning / …
python3 forgequant.py bench corpus code -o bench/corpora/code.txt --answers --mix reasoningOr declare it in a recipe and let build do everything:
"bench": {"keys": ["humaneval","mbpp","mmlu_cs"], "answers": true, "mix": "reasoning", "cap": 400}--answers adds the gold answer as an assistant turn (so the imatrix sees the
activation paths of answering, not just reading); --mix DOMAIN interleaves a
general set so a domain imatrix doesn't over-specialize. Every corpus build records
its provenance (benchmark keys, row SHAs, upstream dataset commit, options) under
bench/runs/ — tracked in the repo, so a calibration is always traceable to the exact
benchmark snapshot. forgequant never redistributes benchmark data: rows are fetched
from HF on demand.
Pinned & verifiable. forgequant talks to benchy through its stable api contract
(benchy/api.py, API_VERSION), never its internals — so a benchy refactor can't
break forgequant. benchy's benchmarks.lock.json pins each benchmark to an exact
upstream dataset commit + content hash; fetches are verified against it, so upstream
drift is detected, not silently absorbed. Inspect with python3 benchy/api.py status;
accept an upstream change with python3 benchy/api.py relock <key>. If you run against
an older benchy without api.py, forge_bench falls back to the legacy fetcher
(unpinned) automatically.
python3 forgequant.py paths coder-q4boost # per-layer/per-expert heatmap
python3 forgequant.py paths a.dat --diff b.dat # what does CODE light up that MEDICAL doesn't?
python3 forgequant.py suggest coder-q4boost --top 6 --type q4_k # boost proposal + size costpaths parses the .dat directly (the format packs one importance vector per
expert per routed tensor) and shows where the workload concentrates. suggest
turns that into a ready-to-paste boost block with an estimated size delta.
Values are count-normalized activation energy: how hard an expert works when
routed; never-routed experts show as cold (zero).
A single-file web dashboard drives forgequant from the browser — template gallery, recipe builder (families + boost + imatrix), build/quantize/imatrix/capture/splice actions, live progress (per-tensor, ETA), an interactive brain map of any imatrix (43×256 heatmap, hot layers, diff between two imatrices, one-click boost suggestion), past-runs browser, and the table of forged models.
python3 forge_ui.py # -> http://localhost:8060Stdlib only; same DS4_DIR / MODELS_DIR config.
{
"name": "coder-q4boost",
"description": "...",
"hf": "{models}/DeepSeek-V4-Flash-FP", // FP safetensors source
"template": "{models}/<base>.gguf", // metadata/order/shapes; non-listed families copied from here
"imatrix": "{models}/coder.dat", // legacy .dat imatrix (applied per expert)
"corpus": "{models}/coder_corpus.txt", // optional: build the imatrix from this if missing
"imatrix_max_tokens": 120000,
"quant": { // family -> quant type (only what you change)
"routed_w1": "iq2_xxs", // gate experts
"routed_w3": "iq2_xxs", // up experts
"routed_w2": "q2_k" // down experts
},
"boost": { // per-layer expert upcast (optional)
"layers": "auto:6", // N hottest layers from the imatrix — or "37-42", or [37,40]
"type": "q4_k",
"families": ["w1","w2","w3"] // optional subset
},
"tensor_types": {"blk.0.": "q8_0"}, // raw --tensor-type prefix overrides (optional)
"reuse": "{models}/DeepSeek-V4-Flash-coder-iq2.gguf", // copy unchanged tensors from a prior build (optional)
"splice": { // fast layer boost without requantizing (optional)
"donor": "{models}/<q4-variant>.gguf",
"layers": "auto:6"
},
"threads": 16
}Families: routed_w1/w2/w3 (gate/down/up experts), experts (all three),
attention, attn_proj, shared, embedding, output, dense. Anything you
omit is copied verbatim from template.
Producible quant types: deepseek4-quantize can only generate iq2_xxs,
q2_k, q4_k, q8_0 (plus f16/bf16/f32 passthrough) — these are the
ds4q_can_quantize() types in ds4's quants.c. Other names (q3_k, iq3_xxs,
iq2_s, q5_k, q6_k, …) parse but the quantizer rejects them with "unsupported
quant target type", so forgequant validates recipes up front.
{models} expands to $MODELS_DIR, {name} to the recipe name, ~ to your home.
| Granularity | Mechanism | Cost to test |
|---|---|---|
| family (all experts) | quant → --routed-w1/w2/w3 |
full requantize |
| expert (within a tensor) | imatrix — per-expert bit steering, automatic |
imatrix run |
| layer (chosen experts ×3 tensors) | boost → --tensor-type overrides |
requantize changed layers only with reuse |
| layer, instantly | splice — copy from donor GGUF |
minutes |
Per-expert types inside one fused tensor aren't possible (GGUF stores one type
per tensor; verified in deepseek4-quantize.c) — the imatrix's per-expert bit
steering plus layer boost is the practical equivalent.
reuse (incremental re-quantize). A boost only changes a few layers, but a
plain requantize regenerates all 43 from FP. Point reuse at a prior build with the
same imatrix (e.g. a coder-iq2 for a coder-q4boost) and the quantizer copies
the byte-identical unchanged tensors, regenerating only the boosted ones — ~85% less
work. Safe by construction: it's gated on a quantize.reuse_key (hash of the
safetensors index + imatrix) plus a per-tensor type/shape match, so a mismatched or
missing prior falls back to a full quantize. Needs ds4's
deepseek4-quantize --reuse (this fork).
| Recipe | Idea | For |
|---|---|---|
medical-iq2 |
IQ2_XXS · Q2_K + medical imatrix | the proven BeepMed recipe |
coder-iq2 |
same budget, code-calibrated imatrix | coding workloads |
coder-q4boost |
coder-iq2 + Q4_K on the 6 code-hottest layers | "keep my coding expert sharp" |
medical-q4boost |
medical-iq2 + Q4_K on the 6 med-hottest layers | BeepMed, higher fidelity |
last6-q4boost |
static Q4_K on layers 37-42 | ds4's proven mixed experiment |
splice-fast |
copy hot layers from a Q4 donor | fastest A/B loop |
balanced |
Q4_K gate/up · Q2_K down | bigger, higher fidelity |
aggressive |
IQ2_XXS everywhere | smallest, most lossy |
Each forgequant output is a drop-in model. Serve it and measure with benchy:
ds4-server -m ~/BEEP/ds4-models/DeepSeek-V4-Flash-coder-q4boost.gguf --ssd-streaming --port 8000 &
python3 ../benchy/eval_mcq.py data/humaneval.jsonl 60 think coder-q4boostFor quick quality signals without a benchmark run, ds4's
gguf-tools/quality-testing/ scores GGUF variants by NLL against official
DeepSeek continuations.
Quantization is deterministic: the same recipe + the same imatrix produce the same
GGUF. Every quantize/splice writes <out>.manifest.json — the resolved recipe,
the exact command, the ds4 git revision, duration, and SHA-256 + size of both the
imatrix and the output — so a result is always traceable to its inputs.
python3 test_forge.py # stdlib unittest; covers the .dat parser, renderer, recipes, UI guardsforgequant is a thin orchestrator over ds4 / DwarfStar by Salvatore Sanfilippo
(antirez) — specifically gguf-tools/deepseek4-quantize,
ds4 --imatrix-dataset, ds4-server --imatrix-out and
gguf-tools/mixed/splice_mixed_expert_layers_gguf.py. All the real quantization
work is theirs; forgequant only turns a recipe into the right invocation and
records what it did. ds4 is a separate project under its own license.
MIT — see LICENSE. Does not cover ds4 or any model weights.