Cannot use MTP

#6
by nickmok - opened

When I run it wit llama server, it shows

WARN [              load_model] WARNING: MTP speculative stage requested, but model has 0 NextN layers. MTP will be disabled.
 | tid="124161080483840" timestamp=1779601638

and it is running very slowly.

This repo does not have MTP quants.

Maybe it should? ...From my limited research it seems like the Qwen3.6-40B base model does not have it.

The issue was related to Transformers/ LLamacpp at the time this model was built.
MTP tensors / layers were stripped out / ignored.

The Heretic base, 40B base, training and GGUFs do not have MTP layers/tensors.
In order to (do it properly) the entire training must be redone and quants.

It is unclear at this time if tranformers has been updated/fixed to address MTP issues.
There are also other pipeline issues being addressed too which affect MTP.

You can not simply put "untrained" MTP tensors back in -;

@DavidAU I actually just implemented this. I injected the MTP head from the base Qwen3.6-27B (BF16 precision, all 15 tensors) into the 40B Q6_K GGUF. The hidden dimensions match (5120) since the expansion only added depth, so the tensors are dimensionally compatible without retraining.
Results: 68-72% acceptance rate depending on context length, ~40% speedup on generation (56 t/s vs ~40 t/s baseline) on an RTX PRO 6000.
Published here: https://huggingface.co/PiehSoft/Qwen3.6-40B-Deckard-MTP-Q6_K
You're right that the MTP head wasn't trained on the 40B's hidden states, but because the expansion preserved the hidden dimension, the untrained donor head still projects well enough to be useful. Self-distillation would push it higher, but 72% on fresh context is already above the typical threshold for net positive speculative decoding.

@WTPieh

Excellent work ; and results - thank you!

Sign up or log in to comment