---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
base_model: huihui-ai/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated
base_model_relation: quantized
tags:
  - transformers
  - safetensors
  - qwen3_5
  - quantized
  - nvfp4
  - fp4
  - 4-bit
  - compressed-tensors
  - llm-compressor
  - vllm
  - image-text-to-text
  - multimodal
  - reasoning
datasets:
  - neuralmagic/calibration
---

# Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-NVFP4

This repository contains an NVFP4-compressed version of [huihui-ai/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated).

The language model weights are compressed to NVFP4 for efficient inference on recent NVIDIA GPUs, while the multimodal weights are kept in BF16 and repacked into a separate `model-multimodal-extra.safetensors` file so that `Qwen3_5ForConditionalGeneration` behavior is preserved.

## What Was Quantized

- Source model: `huihui-ai/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated`
- Quantization method: `llmcompressor` one-shot NVFP4
- Calibration dataset: `neuralmagic/calibration` (`LLM` split)
- Calibration samples: `512`
- Calibration sequence length: `2048`
- Quantized targets: `Linear`
- Excluded from quantization:
  - `lm_head`
  - all `model.visual.*` linear layers
  - `linear_attn.in_proj_a`
  - `linear_attn.in_proj_b`

## Repository Layout

- `model-00001-of-00005.safetensors` to `model-00005-of-00005.safetensors`
  - NVFP4 main language-model shards
- `model-multimodal-extra.safetensors`
  - BF16 multimodal tensors preserved from the source checkpoint
- `model.safetensors.index.json`
  - combined index for the main NVFP4 shards plus multimodal extra tensors
- `processor_config.json`
  - multimodal processor config copied from the source model
- `recipe.yaml`
  - the quantization recipe used for this build

### Stored Tensor Metadata

- `total_parameters`: `16713682960`
- `total_size`: `19743450720`
- `hybrid_extra_tensor_count`: `333`
- `hybrid_extra_tensor_bytes`: `921460192`

## Serving Notes

Tested locally with:

- `vllm/vllm-openai:cu130-nightly`
- vLLM `0.17.2rc1.dev153+g39474513f`
- NVIDIA RTX 5090
- `VLLM_NVFP4_GEMM_BACKEND=marlin`

Observed behavior with reasoning enabled:

- `POST /v1/chat/completions`
  - returns `message.reasoning`
  - can also return a normal `message.content` if `max_tokens` is large enough
- `POST /v1/responses`
  - returns reasoning blocks under `output[].type = "reasoning"`
  - returns final text under `output[].type = "message"` and `content[].type = "output_text"`

For robust client integration, prefer reading the structured `responses` output instead of assuming the top-level `text` field is populated.

## Example vLLM Command

```bash
vllm serve /path/to/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-NVFP4 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8
```

## Notes

- The source model's safety, licensing, and usage constraints still apply.
- This repo keeps multimodal capability by preserving the original visual tower in BF16 instead of re-quantizing it.