Instructions to use tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8") model = AutoModelForImageTextToText.from_pretrained("tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8
- SGLang
How to use tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8 with Docker Model Runner:
docker model run hf.co/tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8
Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated FP8 (vision-preserving)
Calibrated FP8 quantization of huihui-ai/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated that preserves vision-language capabilities, unlike vLLM's dynamic FP8 which destroys them.
TL;DR
If you serve this model in vLLM with --quantization fp8 (dynamic FP8), the model completely loses vision and hallucinates random descriptions for every image (we tested: image of an anime girl in a kimono → model says "bowling ball", "Chris Hemsworth in a suit", "blank white rectangle"). This checkpoint avoids that by using calibrated FP8 with the vision tower kept in BF16.
Why dynamic FP8 destroys Qwen3.5-VL vision
Dynamic --quantization fp8 in vLLM walks every Linear layer in the model and quantizes weights to FP8 (E4M3, ~256 representable values per range) using a single tensor-wide scale per layer. For pure language layers this is fine because activations sit in similar ranges. For vision-language models it is catastrophic, and here is the exact failure mode:
The vision merger is the only bridge between visual encoder and LM. Qwen3.5-VL's visual tower outputs 1152-dim image features. A single Linear merger projects those features into the 5120-dim LM embedding space. If this projection is even slightly wrong, the LM gets noise instead of meaningful visual tokens.
The merger has a much wider weight distribution than LM layers. It needs to encode visual patterns at many scales, so its weights span a wider range than typical LM weights.
Single-scale FP8 crushes the merger. With one tensor-wide scale, important small-magnitude weights in the merger get rounded to zero. The projector outputs become noise.
The LM still receives "image embeddings" at the image token positions, but they are noise. The LM has no useful image information, falls back to text-only generation, and hallucinates plausible-sounding descriptions from the text prompt alone.
The Opus-distilled fine-tune amplifies the problem. The Claude 4.6 Opus distillation used text-only training data (Claude can't share image tensors), which already weakened the vision-LM connection. Dynamic FP8 finishes the job.
End result with dynamic FP8: vision is 0 percent functional. The model generates wrong-but-plausible descriptions for every image and never sees the actual content.
What this checkpoint does differently
This checkpoint uses calibrated FP8 with explicit visual-tower exclusion:
- Per-channel weight scales instead of one global scale per layer. Each row of each weight matrix has its own scale, computed from the actual weight distribution. This preserves precision in layers with wide dynamic range.
- Visual tower kept in BF16 via the
ignorelist (re:.*visual.*,re:.*vision.*). - Visual merger kept in BF16 (
re:.*merger.*). This is the critical bridge layer. - lm_head kept in BF16 (always a good idea, sensitive layer).
- Dynamic activation quantization at inference time, computed per-token, so activations get the right scale for whatever they actually contain.
The vision encoder, the merger that bridges vision and LM, and the LM head all run in BF16 exactly as in the original model. Only the body of the language model (attention and MLP linears) is FP8.
Result
- Vision quality: ~99 percent of the BF16 original. Confirmed working on the same images that completely break dynamic FP8.
- LM quality: ~99 percent of the BF16 original (well within benchmark noise for FP8).
- VRAM: ~28 GB (down from ~54 GB BF16). Half the size.
- Speed: ~2x faster than BF16 on H100/H200/B200, identical to dynamic FP8.
Quantization details
- Tool: llmcompressor main branch
- Scheme:
FP8_DYNAMIC(per-channel weight scales, dynamic activation scales) - Targets: All
Linearlayers - Excluded modules:
lm_headre:.*visual.*(entire visual tower)re:.*merger.*(vision-to-LM merger)re:.*vision.*(anything else vision-related)
- Original size: ~54 GB BF16
- FP8 size: ~28 GB
Usage with vLLM
python -m vllm.entrypoints.openai.api_server \
--model tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8 \
--max-model-len 16384 \
--gpu-memory-utilization 0.50 \
--max-num-seqs 2 \
--trust-remote-code
IMPORTANT: Do NOT pass --quantization fp8. The model already has its quantization config baked in via compressed-tensors. vLLM will detect it from the config and use the proper FP8 path. Passing --quantization fp8 would re-quantize the already-FP8 weights and break everything.
Credits
- Original model: huihui-ai
- Base model: Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled
- Quantization: tacodevs
- Downloads last month
- 25
Model tree for tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8
Base model
Qwen/Qwen3.5-27B