--- license: apache-2.0 library_name: transformers pipeline_tag: image-text-to-text base_model: huihui-ai/Huihui-Qwen3.5-27B-abliterated base_model_relation: quantized tags: - transformers - safetensors - qwen3_5 - quantized - nvfp4 - fp4 - 4-bit - compressed-tensors - llm-compressor - vllm - image-text-to-text - multimodal - conversational datasets: - neuralmagic/calibration --- # Huihui-Qwen3.5-27B-abliterated-NVFP4 This repository contains an NVFP4-compressed version of [huihui-ai/Huihui-Qwen3.5-27B-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3.5-27B-abliterated). The goal of this build is different from a text-only repack: preserve the original model's multimodal behavior while compressing the language model weights to NVFP4 for efficient inference on recent NVIDIA GPUs. ## Source And References - Source model: [huihui-ai/Huihui-Qwen3.5-27B-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3.5-27B-abliterated) - Reference quantization release: [Kbenkhaled/Qwen3.5-27B-NVFP4](https://huggingface.co/Kbenkhaled/Qwen3.5-27B-NVFP4) - Quantization library: [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor) - vLLM patch/runtime reference: [Li-Lee/vllm-qwen3.5-nvfp4-5090](https://github.com/Li-Lee/vllm-qwen3.5-nvfp4-5090) This model follows the same NVFP4 recipe family used in the reference release, adapted to the Huihui abliterated checkpoint and repacked to keep the multimodal components required by `Qwen3_5ForConditionalGeneration`. ## What Was Quantized - `language_model` `Linear` layers were quantized to NVFP4 with `llm-compressor`. - Calibration used `neuralmagic/calibration`, `512` samples, sequence length `2048`. - Vision tower, multimodal merger, and other non-quantized multimodal weights remain in BF16 so image and video pathways are preserved. - The ignore list also excludes `linear_attn.in_proj_a` and `linear_attn.in_proj_b`, matching the stability constraints observed during quantization on this checkpoint. The extra file `model-multimodal-extra.safetensors` stores the preserved multimodal tensors that are not part of the compressed language-model shards. ## Why The Hub May Show About 17B Instead Of 27B Hugging Face derives the displayed parameter count for compressed safetensor repos from the stored tensor payloads, not from the original dense model's logical parameter count. For this repository, the Hub API reports roughly `16.7B` stored elements across `U8`, `F8_E4M3`, `BF16`, and `F32` tensors because the NVFP4-compressed language-model weights are packed. The underlying source model is still `huihui-ai/Huihui-Qwen3.5-27B-abliterated`, and this release is intended as its NVFP4-compressed multimodal variant rather than a separate native `17B` architecture. ## Inference As tested locally, this model works with a custom NVFP4-capable vLLM build for RTX 5090 based on the patch repository above. A representative serve command is: ```bash vllm serve lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4 \ --reasoning-parser qwen3 \ --enable-prefix-caching ``` Depending on your vLLM build, you may also need the same NVFP4 runtime flags used by the reference repository. ## Evaluation Evaluation numbers are intentionally omitted for now. The current local `vLLM + OpenAI-compatible API + RTX 5090` path is suitable for serving and basic smoke testing, but it does not yet reproduce the reference evaluation setup cleanly enough for publication-quality benchmark numbers: - the original `leaderboard_gpqa_diamond` path understated quality because it used a `2047` token limit before the local harness wrapper was fixed - the current `chat-completions` path on this RTX 5090 / patched vLLM stack can emit reasoning in a separate field while leaving `message.content` empty when `thinking` is enabled, which makes benchmark parity with the reference release unreliable Benchmarks will be added back once GPQA Diamond, IFEval, and MMLU-Redux have been rerun with a reproducible configuration that matches the intended evaluation protocol. ## Notes - Architecture: `Qwen3_5ForConditionalGeneration` - Pipeline tag: `image-text-to-text` - Quantization format: `nvfp4-pack-quantized` - Repository layout intentionally includes `processor_config.json`, `preprocessor_config.json`, and `video_preprocessor_config.json` so multimodal preprocessing remains available.