Thanks and request for FP8 version

by janreges3 - opened Apr 27

Apr 27

thank you very much for this model, it works very well for my use-cases and even MTP with 3 tokens has a very decent acceptance ratio.

But would it be possible to prepare an FP8 version, which would also keep MTP?

Also, would it be possible to create a version distilled from Opus 4.7 (current is based on Opus 4.6)?

rico03

Owner Apr 27

From what I know you need the full model for MTP, but you can try to quantize it. I've already a GGUF version in another repo. It's very easy to quantize, you have only to use llama.

For opus4.7: fine tuning like this one only add a reasoning pattern and not more knowledge. So the way in which opus4.6 and opau4.7 reason are very similar. To create a fine tuned version with opus4.7 I'll need a dataset of samples, but the most reliable are with opus4.6,so it will be a bet.

Thank you for your comment!

rico03

Owner Apr 27

You can use MTP also with fp8. You can do it in two way:

on-the-fly:

vllm serve rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled
--dtype fp8
--quantization fp8

pre-quantize

pip install auto-fp8
python -c "
from auto_fp8 import AutoFP8ForCausalLM
model = AutoFP8ForCausalLM.from_pretrained('rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled')
model.save_pretrained('./qwen36-fp8')
"

And then run the quantize version

janreges3

Apr 27

Thank you, I'm currently using online quantization because I have a 96GB VRAM GPU, but I wrote this request for the sake of others who might not have the capacity for online quantization.

In the next few days I will try to find time to create and publish such a quantization myself :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment