Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated FP8 (vision-preserving)

Calibrated FP8 quantization of huihui-ai/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated that preserves vision-language capabilities, unlike vLLM's dynamic FP8 which destroys them.

TL;DR

If you serve this model in vLLM with --quantization fp8 (dynamic FP8), the model completely loses vision and hallucinates random descriptions for every image (we tested: image of an anime girl in a kimono → model says "bowling ball", "Chris Hemsworth in a suit", "blank white rectangle"). This checkpoint avoids that by using calibrated FP8 with the vision tower kept in BF16.

Why dynamic FP8 destroys Qwen3.5-VL vision

Dynamic --quantization fp8 in vLLM walks every Linear layer in the model and quantizes weights to FP8 (E4M3, ~256 representable values per range) using a single tensor-wide scale per layer. For pure language layers this is fine because activations sit in similar ranges. For vision-language models it is catastrophic, and here is the exact failure mode:

  1. The vision merger is the only bridge between visual encoder and LM. Qwen3.5-VL's visual tower outputs 1152-dim image features. A single Linear merger projects those features into the 5120-dim LM embedding space. If this projection is even slightly wrong, the LM gets noise instead of meaningful visual tokens.

  2. The merger has a much wider weight distribution than LM layers. It needs to encode visual patterns at many scales, so its weights span a wider range than typical LM weights.

  3. Single-scale FP8 crushes the merger. With one tensor-wide scale, important small-magnitude weights in the merger get rounded to zero. The projector outputs become noise.

  4. The LM still receives "image embeddings" at the image token positions, but they are noise. The LM has no useful image information, falls back to text-only generation, and hallucinates plausible-sounding descriptions from the text prompt alone.

  5. The Opus-distilled fine-tune amplifies the problem. The Claude 4.6 Opus distillation used text-only training data (Claude can't share image tensors), which already weakened the vision-LM connection. Dynamic FP8 finishes the job.

End result with dynamic FP8: vision is 0 percent functional. The model generates wrong-but-plausible descriptions for every image and never sees the actual content.

What this checkpoint does differently

This checkpoint uses calibrated FP8 with explicit visual-tower exclusion:

  • Per-channel weight scales instead of one global scale per layer. Each row of each weight matrix has its own scale, computed from the actual weight distribution. This preserves precision in layers with wide dynamic range.
  • Visual tower kept in BF16 via the ignore list (re:.*visual.*, re:.*vision.*).
  • Visual merger kept in BF16 (re:.*merger.*). This is the critical bridge layer.
  • lm_head kept in BF16 (always a good idea, sensitive layer).
  • Dynamic activation quantization at inference time, computed per-token, so activations get the right scale for whatever they actually contain.

The vision encoder, the merger that bridges vision and LM, and the LM head all run in BF16 exactly as in the original model. Only the body of the language model (attention and MLP linears) is FP8.

Result

  • Vision quality: ~99 percent of the BF16 original. Confirmed working on the same images that completely break dynamic FP8.
  • LM quality: ~99 percent of the BF16 original (well within benchmark noise for FP8).
  • VRAM: ~28 GB (down from ~54 GB BF16). Half the size.
  • Speed: ~2x faster than BF16 on H100/H200/B200, identical to dynamic FP8.

Quantization details

  • Tool: llmcompressor main branch
  • Scheme: FP8_DYNAMIC (per-channel weight scales, dynamic activation scales)
  • Targets: All Linear layers
  • Excluded modules:
    • lm_head
    • re:.*visual.* (entire visual tower)
    • re:.*merger.* (vision-to-LM merger)
    • re:.*vision.* (anything else vision-related)
  • Original size: ~54 GB BF16
  • FP8 size: ~28 GB

Usage with vLLM

python -m vllm.entrypoints.openai.api_server \
    --model tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.50 \
    --max-num-seqs 2 \
    --trust-remote-code

IMPORTANT: Do NOT pass --quantization fp8. The model already has its quantization config baked in via compressed-tensors. vLLM will detect it from the config and use the proper FP8 path. Passing --quantization fp8 would re-quantize the already-FP8 weights and break everything.

Credits

Downloads last month
25
Safetensors
Model size
27B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tacodevs/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-FP8