Text Generation
Safetensors
English
multilingual
qwen3_5
qwen3.6
reasoning
distillation
claude-opus
lora
unsloth
fine-tuned
conversational

Thanks and request for FP8 version

#2
by janreges3 - opened

Hi @rico03 ,

thank you very much for this model, it works very well for my use-cases and even MTP with 3 tokens has a very decent acceptance ratio.

But would it be possible to prepare an FP8 version, which would also keep MTP?

Also, would it be possible to create a version distilled from Opus 4.7 (current is based on Opus 4.6)?

Owner

From what I know you need the full model for MTP, but you can try to quantize it. I've already a GGUF version in another repo. It's very easy to quantize, you have only to use llama.

For opus4.7: fine tuning like this one only add a reasoning pattern and not more knowledge. So the way in which opus4.6 and opau4.7 reason are very similar. To create a fine tuned version with opus4.7 I'll need a dataset of samples, but the most reliable are with opus4.6,so it will be a bet.

Thank you for your comment!

Owner

You can use MTP also with fp8. You can do it in two way:

  1. on-the-fly:

vllm serve rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled
--dtype fp8
--quantization fp8

  1. pre-quantize

pip install auto-fp8
python -c "
from auto_fp8 import AutoFP8ForCausalLM
model = AutoFP8ForCausalLM.from_pretrained('rico03/Qwen3.6-27B-Claude-Opus-Reasoning-Distilled')
model.save_pretrained('./qwen36-fp8')
"

And then run the quantize version

Thank you, I'm currently using online quantization because I have a 96GB VRAM GPU, but I wrote this request for the sake of others who might not have the capacity for online quantization.

In the next few days I will try to find time to create and publish such a quantization myself :)

Sign up or log in to comment