Thanks for the quantizations, can we get MTP Qwen 3.5 397B GGUF?

#5
by tidjei43 - opened

Thanks for the quantizations, can we get MTP Qwen 3.5 397B GGUF?
Because Llama cpp merged MTP branch into the master branch yesterday

Yes I'm looking into it, I plan to release as a separate MTP file similar to how mmproj is done, just testing it myself to make sure it works as expected and so I have instructions on how to use it!

Thanks for updating this model with MTP! A bit sad that I can't fully offload the IQ1_M anymore, as when you get to these smaller quants every little bit helps accuracy!

I tested the IQ1_S, and it's creating good outputs, at great speeds. Model is split across 1x4090, 3x3090, full GPU offload, PP = 409.23 tokens/s, TG = 68.35 t/s.

For comparison I was testing the IQ1_M last night, without MTP and it was around 40 t/s TG. still fast, but this is a welcome boost!

I'm going to test out a higher quant with MTP enabled and see how that fares, maybe I can still retain usable TG while getting good accuracy! We'll see -


EDIT

IQ4_XS + -cpu-moe in combination with --spec-draft-cpu-moe results in very usable speeds, at a much more desirable quant! Lower PP was expected.

Final results for now:

IQ1_S w/ full GPU offload // PP = 409.23 tokens/s, TG = 68.35 t/s
IQ4_XS w/ -cpu-moe parameters // PP = 61.72 tokens/s, TG = 24.14 t/s


Final EDIT:

ik_llama.cpp is much better for partial CPU offload, resulting in even better IQ4_XS numbers.

ik_llama.cpp // IQ4_XS
PP = 154.67 t/s
TG = 33.70 t/s

Sign up or log in to comment