--- license: apache-2.0 library_name: transformers pipeline_tag: image-text-to-text base_model: huihui-ai/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated base_model_relation: quantized tags: - transformers - safetensors - qwen3_5 - quantized - nvfp4 - fp4 - 4-bit - compressed-tensors - llm-compressor - vllm - image-text-to-text - multimodal - reasoning datasets: - neuralmagic/calibration --- # Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-NVFP4 This repository contains an NVFP4-compressed version of [huihui-ai/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated). The language model weights are compressed to NVFP4 for efficient inference on recent NVIDIA GPUs, while the multimodal weights are kept in BF16 and repacked into a separate `model-multimodal-extra.safetensors` file so that `Qwen3_5ForConditionalGeneration` behavior is preserved. ## What Was Quantized - Source model: `huihui-ai/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated` - Quantization method: `llmcompressor` one-shot NVFP4 - Calibration dataset: `neuralmagic/calibration` (`LLM` split) - Calibration samples: `512` - Calibration sequence length: `2048` - Quantized targets: `Linear` - Excluded from quantization: - `lm_head` - all `model.visual.*` linear layers - `linear_attn.in_proj_a` - `linear_attn.in_proj_b` ## Repository Layout - `model-00001-of-00005.safetensors` to `model-00005-of-00005.safetensors` - NVFP4 main language-model shards - `model-multimodal-extra.safetensors` - BF16 multimodal tensors preserved from the source checkpoint - `model.safetensors.index.json` - combined index for the main NVFP4 shards plus multimodal extra tensors - `processor_config.json` - multimodal processor config copied from the source model - `recipe.yaml` - the quantization recipe used for this build ### Stored Tensor Metadata - `total_parameters`: `16713682960` - `total_size`: `19743450720` - `hybrid_extra_tensor_count`: `333` - `hybrid_extra_tensor_bytes`: `921460192` ## Serving Notes Tested locally with: - `vllm/vllm-openai:cu130-nightly` - vLLM `0.17.2rc1.dev153+g39474513f` - NVIDIA RTX 5090 - `VLLM_NVFP4_GEMM_BACKEND=marlin` Observed behavior with reasoning enabled: - `POST /v1/chat/completions` - returns `message.reasoning` - can also return a normal `message.content` if `max_tokens` is large enough - `POST /v1/responses` - returns reasoning blocks under `output[].type = "reasoning"` - returns final text under `output[].type = "message"` and `content[].type = "output_text"` For robust client integration, prefer reading the structured `responses` output instead of assuming the top-level `text` field is populated. ## Example vLLM Command ```bash vllm serve /path/to/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-NVFP4 \ --reasoning-parser qwen3 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 ``` ## Notes - The source model's safety, licensing, and usage constraints still apply. - This repo keeps multimodal capability by preserving the original visual tower in BF16 instead of re-quantizing it.