Skip to content

Clamp MTP draft depth to the prefill capacity#381

Open
pandysp wants to merge 1 commit into
antirez:mainfrom
pandysp:fix-mtp-draft-prefill-cap
Open

Clamp MTP draft depth to the prefill capacity#381
pandysp wants to merge 1 commit into
antirez:mainfrom
pandysp:fix-mtp-draft-prefill-cap

Conversation

@pandysp

@pandysp pandysp commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

--mtp-draft 2 with --prefill-chunk 1 (or DS4_METAL_PREFILL_CHUNK=1) aborts generation a few tokens in:

./ds4 -m ds4flash.gguf --mtp <mtp.gguf> --mtp-draft 2 --prefill-chunk 1 --temp 0 -n 64 -p "..."
ds4: decode failed: MTP verifier failed

The speculative path hands the batched verifier more rows than the prefill scratch holds: metal_graph_verify_suffix_tops rejects n_tokens > prefill_cap up front, and at depth 2 there is no frontier snapshot to fall back on, so the capacity rejection is treated as a fatal verifier failure. Plain decode at chunk 1 is fine, --mtp-draft 3 survives through the depth>2 snapshot fallback, strict mode falls back cleanly; only the default draft-2 configuration dies.

The fix is a one-line clamp of the draft budget to prefill_cap, next to the existing clamps, so a small prefill window lowers the speculation depth instead of aborting. At chunk 1 that leaves depth-1 speculation through a 1-row verify, and the repro above generates byte-identical output to plain decode. It also helps deeper drafts: with a window smaller than the draft depth they currently attempt the batched verify, fail its capacity check, and fall back to the sequential verifier on every cycle; the clamp sizes the verify to fit.

The alternative would be routing the capacity rejection to the sequential fallback, but that needs the verifier to distinguish a rejection before any GPU work from a failure mid-flight; the clamp removes the failure class instead.

Verified on Metal: the repro above plus ./ds4_test --mtp-verify-depth on the default chunk, where the clamp is inert.

Found while reviewing #371; its continuous gate carries the same prefill_cap check for the same reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant