Clamp MTP draft depth to the prefill capacity#381
Open
pandysp wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
--mtp-draft 2with--prefill-chunk 1(orDS4_METAL_PREFILL_CHUNK=1) aborts generation a few tokens in:The speculative path hands the batched verifier more rows than the prefill scratch holds:
metal_graph_verify_suffix_topsrejectsn_tokens > prefill_capup front, and at depth 2 there is no frontier snapshot to fall back on, so the capacity rejection is treated as a fatal verifier failure. Plain decode at chunk 1 is fine,--mtp-draft 3survives through the depth>2 snapshot fallback, strict mode falls back cleanly; only the default draft-2 configuration dies.The fix is a one-line clamp of the draft budget to
prefill_cap, next to the existing clamps, so a small prefill window lowers the speculation depth instead of aborting. At chunk 1 that leaves depth-1 speculation through a 1-row verify, and the repro above generates byte-identical output to plain decode. It also helps deeper drafts: with a window smaller than the draft depth they currently attempt the batched verify, fail its capacity check, and fall back to the sequential verifier on every cycle; the clamp sizes the verify to fit.The alternative would be routing the capacity rejection to the sequential fallback, but that needs the verifier to distinguish a rejection before any GPU work from a failure mid-flight; the clamp removes the failure class instead.
Verified on Metal: the repro above plus
./ds4_test --mtp-verify-depthon the default chunk, where the clamp is inert.Found while reviewing #371; its continuous gate carries the same prefill_cap check for the same reason.