Skip to content

andreaborio/forgequant

Repository files navigation

forgequant

Config-driven asymmetric quantization for DeepSeek-V4-Flash.

forgequant turns a small JSON recipe — per-tensor-family quant types, an importance matrix (imatrix), and optionally a per-layer expert boost — into a quantized GGUF, reproducibly, with a manifest. It wraps ds4's deepseek4-quantize (plus ds4 --imatrix-dataset, ds4-server --imatrix-out and the GGUF splicer), so you stop hand-assembling long quantizer command lines.

The point: a model's 2-bit budget can be spent asymmetrically, at three depths —

  1. family — keep attention/shared/output near-lossless (Q8), push the routed experts to 2 bits;
  2. expert — an imatrix re-allocates those 2 bits inside every tensor toward the experts your workload actually activates (ds4 records per-(layer, expert) activation statistics);
  3. layerboost upcasts the routed experts of the layers your workload lives in (e.g. Q4_K on the 6 hottest), via --tensor-type overrides — or splice copies them from a donor GGUF in minutes, without requantizing.

forgequant makes that recipe a file you can version, diff, and re-run — and gives you the tools to see the activation paths before you spend the bits.

Specific to DeepSeek-V4-Flash and ds4's quantizer — not a general GGUF tool.

Install

Python 3.8+ (standard library only; numpy used opportunistically if present). You need a built ds4 checkout (provides ds4, ds4-server and gguf-tools/deepseek4-quantize), the FP model source, and a template GGUF.

git clone --recursive https://github.com/andreaborio/forgequant.git   # --recursive pulls benchy
# already cloned? grab the benchy submodule:
git submodule update --init

export DS4_DIR=~/BEEP/ds4               # your ds4 checkout (default ~/ds4)
export MODELS_DIR=~/BEEP/ds4-models     # models/imatrices ({models} in recipes)

benchy (the benchmark source) is vendored as a git submodule, so a recursive clone is self-contained. git submodule update --remote benchy bumps it to benchy's latest registry.

Use

python3 forgequant.py list                  # available recipes
python3 forgequant.py show coder-q4boost    # resolved recipe + EXACT commands (nothing runs)
python3 forgequant.py verify coder-q4boost  # preflight: paths, imatrix, disk space
python3 forgequant.py build coder-q4boost   # full pipeline: imatrix (if missing) -> quantize

<recipe> is a preset name (recipes/<name>.json) or a path to your own .json.

Getting an imatrix (four ways)

The imatrix is the activation-path record that steers the bits. Pick your source:

# 1. from REAL benchmarks (the questions a domain expert faces) — fetched from benchy
python3 forgequant.py build coder-q4boost     # the `bench` block builds the corpus first

# 2. from a corpus you already have (rendered prompt dataset)
python3 forgequant.py imatrix medical-iq2

# 3. from ANY raw prompt list — render it first, no ds4 python tooling needed
python3 forgequant.py render my_prompts.txt -o coder_corpus.txt   # .txt or .jsonl
python3 forgequant.py imatrix coder-q4boost

# 4. from LIVE inference — serve the model, use it for real, Ctrl-C when done
python3 forgequant.py capture coder-q4boost --port 8000

capture wraps ds4-server --imatrix-out: it records only aggregate per-expert activation statistics from your real traffic — no prompt text is ever stored (see ds4's ONEDGE_IMATRIX.md), and snapshots are written periodically. Ctrl-C is a graceful stop: forgequant waits for ds4-server to flush its final snapshot.

Calibrating on real benchmarks (via benchy)

Calibrate a domain imatrix on the questions a domain expert actually faces. Benchmarks come from benchy (github.com/andreaborio/benchy), vendored as a git submodule so everyone who clones forgequant gets the same source — a registry of real, non-saturated evals (MMLU-Pro, SuperGPQA, HumanEval, MBPP, MedXpertQA, MedQA, …) fetched live from the HuggingFace datasets-server and normalized.

git submodule update --init                  # first time: pull benchy in
python3 forgequant.py bench list             # the registry (current vs saturated)
python3 forgequant.py bench bundles          # domain bundles: code / medical / reasoning / …
python3 forgequant.py bench corpus code -o bench/corpora/code.txt --answers --mix reasoning

Or declare it in a recipe and let build do everything:

"bench": {"keys": ["humaneval","mbpp","mmlu_cs"], "answers": true, "mix": "reasoning", "cap": 400}

--answers adds the gold answer as an assistant turn (so the imatrix sees the activation paths of answering, not just reading); --mix DOMAIN interleaves a general set so a domain imatrix doesn't over-specialize. Every corpus build records its provenance (benchmark keys, row SHAs, upstream dataset commit, options) under bench/runs/ — tracked in the repo, so a calibration is always traceable to the exact benchmark snapshot. forgequant never redistributes benchmark data: rows are fetched from HF on demand.

Pinned & verifiable. forgequant talks to benchy through its stable api contract (benchy/api.py, API_VERSION), never its internals — so a benchy refactor can't break forgequant. benchy's benchmarks.lock.json pins each benchmark to an exact upstream dataset commit + content hash; fetches are verified against it, so upstream drift is detected, not silently absorbed. Inspect with python3 benchy/api.py status; accept an upstream change with python3 benchy/api.py relock <key>. If you run against an older benchy without api.py, forge_bench falls back to the legacy fetcher (unpinned) automatically.

Seeing the brain paths

python3 forgequant.py paths coder-q4boost          # per-layer/per-expert heatmap
python3 forgequant.py paths a.dat --diff b.dat     # what does CODE light up that MEDICAL doesn't?
python3 forgequant.py suggest coder-q4boost --top 6 --type q4_k   # boost proposal + size cost

paths parses the .dat directly (the format packs one importance vector per expert per routed tensor) and shows where the workload concentrates. suggest turns that into a ready-to-paste boost block with an estimated size delta. Values are count-normalized activation energy: how hard an expert works when routed; never-routed experts show as cold (zero).

Dashboard (UI)

A single-file web dashboard drives forgequant from the browser — template gallery, recipe builder (families + boost + imatrix), build/quantize/imatrix/capture/splice actions, live progress (per-tensor, ETA), an interactive brain map of any imatrix (43×256 heatmap, hot layers, diff between two imatrices, one-click boost suggestion), past-runs browser, and the table of forged models.

python3 forge_ui.py            # -> http://localhost:8060

Stdlib only; same DS4_DIR / MODELS_DIR config.

Recipe format

{
  "name": "coder-q4boost",
  "description": "...",
  "hf": "{models}/DeepSeek-V4-Flash-FP",      // FP safetensors source
  "template": "{models}/<base>.gguf",          // metadata/order/shapes; non-listed families copied from here
  "imatrix": "{models}/coder.dat",             // legacy .dat imatrix (applied per expert)
  "corpus": "{models}/coder_corpus.txt",       // optional: build the imatrix from this if missing
  "imatrix_max_tokens": 120000,
  "quant": {                                    // family -> quant type (only what you change)
    "routed_w1": "iq2_xxs",   // gate experts
    "routed_w3": "iq2_xxs",   // up   experts
    "routed_w2": "q2_k"       // down experts
  },
  "boost": {                                    // per-layer expert upcast (optional)
    "layers": "auto:6",       // N hottest layers from the imatrix — or "37-42", or [37,40]
    "type": "q4_k",
    "families": ["w1","w2","w3"]                // optional subset
  },
  "tensor_types": {"blk.0.": "q8_0"},          // raw --tensor-type prefix overrides (optional)
  "reuse": "{models}/DeepSeek-V4-Flash-coder-iq2.gguf",  // copy unchanged tensors from a prior build (optional)
  "splice": {                                   // fast layer boost without requantizing (optional)
    "donor": "{models}/<q4-variant>.gguf",
    "layers": "auto:6"
  },
  "threads": 16
}

Families: routed_w1/w2/w3 (gate/down/up experts), experts (all three), attention, attn_proj, shared, embedding, output, dense. Anything you omit is copied verbatim from template.

Producible quant types: deepseek4-quantize can only generate iq2_xxs, q2_k, q4_k, q8_0 (plus f16/bf16/f32 passthrough) — these are the ds4q_can_quantize() types in ds4's quants.c. Other names (q3_k, iq3_xxs, iq2_s, q5_k, q6_k, …) parse but the quantizer rejects them with "unsupported quant target type", so forgequant validates recipes up front.

{models} expands to $MODELS_DIR, {name} to the recipe name, ~ to your home.

Where each knob acts

Granularity Mechanism Cost to test
family (all experts) quant--routed-w1/w2/w3 full requantize
expert (within a tensor) imatrix — per-expert bit steering, automatic imatrix run
layer (chosen experts ×3 tensors) boost--tensor-type overrides requantize changed layers only with reuse
layer, instantly splice — copy from donor GGUF minutes

Per-expert types inside one fused tensor aren't possible (GGUF stores one type per tensor; verified in deepseek4-quantize.c) — the imatrix's per-expert bit steering plus layer boost is the practical equivalent.

reuse (incremental re-quantize). A boost only changes a few layers, but a plain requantize regenerates all 43 from FP. Point reuse at a prior build with the same imatrix (e.g. a coder-iq2 for a coder-q4boost) and the quantizer copies the byte-identical unchanged tensors, regenerating only the boosted ones — ~85% less work. Safe by construction: it's gated on a quantize.reuse_key (hash of the safetensors index + imatrix) plus a per-tensor type/shape match, so a mismatched or missing prior falls back to a full quantize. Needs ds4's deepseek4-quantize --reuse (this fork).

Presets

Recipe Idea For
medical-iq2 IQ2_XXS · Q2_K + medical imatrix the proven BeepMed recipe
coder-iq2 same budget, code-calibrated imatrix coding workloads
coder-q4boost coder-iq2 + Q4_K on the 6 code-hottest layers "keep my coding expert sharp"
medical-q4boost medical-iq2 + Q4_K on the 6 med-hottest layers BeepMed, higher fidelity
last6-q4boost static Q4_K on layers 37-42 ds4's proven mixed experiment
splice-fast copy hot layers from a Q4 donor fastest A/B loop
balanced Q4_K gate/up · Q2_K down bigger, higher fidelity
aggressive IQ2_XXS everywhere smallest, most lossy

A/B with benchy

Each forgequant output is a drop-in model. Serve it and measure with benchy:

ds4-server -m ~/BEEP/ds4-models/DeepSeek-V4-Flash-coder-q4boost.gguf --ssd-streaming --port 8000 &
python3 ../benchy/eval_mcq.py data/humaneval.jsonl 60 think coder-q4boost

For quick quality signals without a benchmark run, ds4's gguf-tools/quality-testing/ scores GGUF variants by NLL against official DeepSeek continuations.

Reproducibility

Quantization is deterministic: the same recipe + the same imatrix produce the same GGUF. Every quantize/splice writes <out>.manifest.json — the resolved recipe, the exact command, the ds4 git revision, duration, and SHA-256 + size of both the imatrix and the output — so a result is always traceable to its inputs.

Tests

python3 test_forge.py    # stdlib unittest; covers the .dat parser, renderer, recipes, UI guards

Built on ds4

forgequant is a thin orchestrator over ds4 / DwarfStar by Salvatore Sanfilippo (antirez) — specifically gguf-tools/deepseek4-quantize, ds4 --imatrix-dataset, ds4-server --imatrix-out and gguf-tools/mixed/splice_mixed_expert_layers_gguf.py. All the real quantization work is theirs; forgequant only turns a recipe into the right invocation and records what it did. ds4 is a separate project under its own license.

License

MIT — see LICENSE. Does not cover ds4 or any model weights.

About

Config-driven asymmetric quantization for DeepSeek-V4-Flash: a recipe (per-family quant types + imatrix) -> a quantized GGUF, with a live web dashboard. Wraps ds4.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages