Possibilities of NVFP4?

#1
by johnlaborxxx - opened

Hi, recently GGUF nvfp4 is merged in a PR.
Wonder if you can also release NVFP4 version of these gguf (qwen3.5&3.6 27b/gemm4 31b?)?
I think NVFP4 is about just a bit worse than Q6 in terms of intelligence but would be 3 times faster on Nvidia GPUs. Thanks.

Owner
β€’
edited May 2

Hi, recently GGUF nvfp4 is merged in a PR.

Yes and no, it's not as a simple a process as creating NVFP4 GGUF, this is actually what happens when you try to create NVFP4 on llama.cpp:

main: invalid ftype 'NVFP4'

Basically NVFP4 is not recognized as a quantization type by llama.cpp's latest version from 3 hours ago.

Wonder if you can also release NVFP4 version of these gguf (qwen3.5&3.6 27b/gemm4 31b?)?

I have actually been spending the whole day yesterday working on this exact thing, but it's been very difficult and it's way more difficult and complicated than simply creating GGUFs with llama.cp.

I think NVFP4 is about just a bit worse than Q6 in terms of intelligence but would be 3 times faster on Nvidia GPUs. Thanks.

Yes, will keep on working on it today, I just have haven't been able to find a recipe that gives you both low size with retained quality, I was able to create one, but it's 27,5 GiB, it's the best quality that I can do but I am not sure if people will be too eager to download an NVFP4 that is just slightly smaller than FP8 and GPTQ-8bit by about 1 GiB, I am suspecting that people might see the size is about 10 GiB bigger than expected and will pass it on for a smaller sized NVFP4 from other uploaders while disregarding the quality tradeoffs (which makes sense, as a higher quality version is not really useful if you can not fit it in your hardware).

Owner

Hi, recently GGUF nvfp4 is merged in a PR.
Wonder if you can also release NVFP4 version of these gguf (qwen3.5&3.6 27b/gemm4 31b?)?
I think NVFP4 is about just a bit worse than Q6 in terms of intelligence but would be 3 times faster on Nvidia GPUs. Thanks.

It took a while, but I might finally have something for you later today.

Owner

Hi, recently GGUF nvfp4 is merged in a PR.
Wonder if you can also release NVFP4 version of these gguf (qwen3.5&3.6 27b/gemm4 31b?)?
I think NVFP4 is about just a bit worse than Q6 in terms of intelligence but would be 3 times faster on Nvidia GPUs. Thanks.

Finally done, here you go:

https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-GGUF

More NVFP4 coming.

Hi @llmfan46 ,
Thanks you!
Before I about to download I see your new upload: https://huggingface.co/lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP
Assume there will also be a gguf for that?

In my limited research seema like NVFP4 MTP gguf > NVFP4 MLP gguf > NVFP4 gguf?
Thanks!

Hi @llmfan46 ,
Thanks you!
Before I about to download I see your new upload: https://huggingface.co/lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP
Assume there will also be a gguf for that?

In my limited research seema like NVFP4 MTP gguf > NVFP4 MLP gguf > NVFP4 gguf?
Thanks!

@johnlaborxxx

Here you go:

https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP-GGUF

Wow, this is wonderful! Thanks a lot @llmfan46 !

I currently only downloaded the MLP version, gonna test diff between normal MLP vs q8 MLP as size diff is huge.
Will try MTP once I am ready as I doubt Kobold currently support MTP natively.

// NextN/MTP tensors are currently ignored (reserved for future MTP support)

My search shows MTP would run but the layer speed up just get ignored.
Therefore, I might need to setup LM studio to fully see the effect.

Thanks!

@johnlaborxxx

Yep sure thing, more models coming soon and be sure to let me know if you need some more NVFP4 GGUFs for some other models of mine, have fun with the models.

Hi,

Well since you asked @llmfan46 , so I will shamelessly wonder if gemma4 is also possible, at least the 31b model since q6 is roughly taken my entire vram.
A MLP NVFP4 q8 or not would definitely help reserve more space for context and keep that high bit of attention layer 😍

This is NOT in anyway urgent as I can still work with the existing gemma4 heretic with 60K context.
Only do this if you have time and resource or interest. πŸ‘

Thanks!

Hi,

Well since you asked @llmfan46 , so I will shamelessly wonder if gemma4 is also possible, at least the 31b model since q6 is roughly taken my entire vram.
A MLP NVFP4 q8 or not would definitely help reserve more space for context and keep that high bit of attention layer 😍

This is NOT in anyway urgent as I can still work with the existing gemma4 heretic with 60K context.
Only do this if you have time and resource or interest. πŸ‘

Thanks!

I can do it, but not right away, I need to finish benchmarking and releasing my newest Gemma 4 31B it uncensored finetune, I also have to re-do and re-upload all the Qwen3.6 and Gemma 4 GGUFs due to changes in chat templates.

Seems like a lot more to redownload then :)
May I ask a question @llmfan46 , if chat template update, why do gguf also need to be rebuild?
Isnt chat template only getting loaded in kobold/ST/LM studio? Or am I thinking this wrong? Thanks.

Seems like a lot more to redownload then :)
May I ask a question @llmfan46 , if chat template update, why do gguf also need to be rebuild?
Isnt chat template only getting loaded in kobold/ST/LM studio? Or am I thinking this wrong? Thanks.

Because it's packed into a GGUF, on a safetensor it's easy to just update the chat_template.jinja, here it isn't the same case.

Seems like this is the best of both worlds:
https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF

Even though kobold does not support MTP yet, I can still enjoy the MLP q8 nvfp4.
And if kobold start to support then we rock!

Then I guess this can be retired?
https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-GGUF
Thanks.

Seems like this is the best of both worlds:
https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF

Yeah it doesn't get better than that now.

Even though kobold does not support MTP yet, I can still enjoy the MLP q8 nvfp4.
And if kobold start to support then we rock!

Then I guess this can be retired?
https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-GGUF
Thanks.

I'll just leave it for now in case someone doesn't care about MTP and just want the smaller sizes.

Sign up or log in to comment