Why is inference slower than the non-MTP version?

#3
by NatsuGatsu - opened

I use the Q5_K_M quant of this model and the non-MTP version. I get faster inference using the non-MTP version while this one is slower by 8 tokens per sec.
My system is this:
RX 9070 16GB
32GB DDR5 6400MT/s
I used the llama.cpp-mtp-turbo-quant fork of llama.cpp

I ran this model(Q8_0) on my StrixHalo GPU and something weird showed up. Once I enabled MTP, the throughput dropped from about 40 t/s to just 20 t/s—roughly half the speed. I couldn’t figure out why.

The pp decay also feels much steeper than with other models. I haven’t run any rigorous tests, but the numbers look off: it starts at ~700 t/s, and as soon as the context length hits 20 k tokens the pp rate bottoms out around 300 t/s. With the usual models that speed drop only shows up around 50 k tokens. Bottom line: this model isn’t behaving normally.

EDIT:
As I’ve been digging deeper into actual usage, I’ve noticed that the model’s MTP isn’t consistently slowing down; instead, it sporadically accelerates, bringing a modest 1.2–1.5× speed boost. The improvement is noticeable, but not as dramatic as I hoped—perhaps it depends on the kinds of tasks I’m running. Either way, this Heretic model is definitely worth trying, and I’m grateful for llmfan46’s hard work in getting it up and running!

@NatsuGatsu , @impenz and @kamjin

I redid the quants with the newest version of llama.cpp and reuploaded, check the new quants if you want.

Thank you

Thank you

You're welcome!

At least in may case noted that the base Qwen3.6 Q4_K_M or Q5_K_S with preserve_thinking true tended to fall in loops, whereas this one doesn't. Also this one did better for me on some coding tests in Julia that I ran. So it would seem to me that the quantization is definitely 'higher quality' or precise than the base GGUF models I was using!!! Nice work and thanks.

Additional comment ref MTP - I have small VRAM (11GB) so impact MTP is not full, but still did better wtih MTP in my case: Q5_K_S: without MTP: 23-24 t/s, with: 27-28 t/s, Q4_K_M: without MTP: 27-28, with: 31:32 (n-cpu-moe at 28 instead of 31 for Q5).

At least in may case noted that the base Qwen3.6 Q4_K_M or Q5_K_S with preserve_thinking true tended to fall in loops, whereas this one doesn't. Also this one did better for me on some coding tests in Julia that I ran. So it would seem to me that the quantization is definitely 'higher quality' or precise than the base GGUF models I was using!!! Nice work and thanks.

Yes unfortunately the Qwen3.6 family of models have a looping issues from time to time when using in chat mode rather than coding/agentic mode, I tried to improve the chat_template.jinja as much as I could without breaking thing and Qwen3.6 family are supposed to be models optimized for coding and agentic tasks, but overall from my own usage I can tell you that for chat mode Qwen3.5 family of models are a lot more stable, they just might not be as good as Qwen3.6 for agentic and coding, but they are better for everything else.

Sign up or log in to comment