Instructions to use lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4") model = AutoModelForImageTextToText.from_pretrained("lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4
- SGLang
How to use lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4 with Docker Model Runner:
docker model run hf.co/lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4
Huihui-Qwen3.5-27B-abliterated-NVFP4
This repository contains an NVFP4-compressed version of huihui-ai/Huihui-Qwen3.5-27B-abliterated.
The goal of this build is different from a text-only repack: preserve the original model's multimodal behavior while compressing the language model weights to NVFP4 for efficient inference on recent NVIDIA GPUs.
Source And References
- Source model: huihui-ai/Huihui-Qwen3.5-27B-abliterated
- Reference quantization release: Kbenkhaled/Qwen3.5-27B-NVFP4
- Quantization library: vllm-project/llm-compressor
- vLLM patch/runtime reference: Li-Lee/vllm-qwen3.5-nvfp4-5090
This model follows the same NVFP4 recipe family used in the reference release, adapted to the Huihui abliterated checkpoint and repacked to keep the multimodal components required by Qwen3_5ForConditionalGeneration.
What Was Quantized
language_modelLinearlayers were quantized to NVFP4 withllm-compressor.- Calibration used
neuralmagic/calibration,512samples, sequence length2048. - Vision tower, multimodal merger, and other non-quantized multimodal weights remain in BF16 so image and video pathways are preserved.
- The ignore list also excludes
linear_attn.in_proj_aandlinear_attn.in_proj_b, matching the stability constraints observed during quantization on this checkpoint.
The extra file model-multimodal-extra.safetensors stores the preserved multimodal tensors that are not part of the compressed language-model shards.
Why The Hub May Show About 17B Instead Of 27B
Hugging Face derives the displayed parameter count for compressed safetensor repos from the stored tensor payloads, not from the original dense model's logical parameter count.
For this repository, the Hub API reports roughly 16.7B stored elements across U8, F8_E4M3, BF16, and F32 tensors because the NVFP4-compressed language-model weights are packed. The underlying source model is still huihui-ai/Huihui-Qwen3.5-27B-abliterated, and this release is intended as its NVFP4-compressed multimodal variant rather than a separate native 17B architecture.
Inference
As tested locally, this model works with a custom NVFP4-capable vLLM build for RTX 5090 based on the patch repository above. A representative serve command is:
vllm serve lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4 \
--reasoning-parser qwen3 \
--enable-prefix-caching
Depending on your vLLM build, you may also need the same NVFP4 runtime flags used by the reference repository.
Evaluation
Evaluation numbers are intentionally omitted for now.
The current local vLLM + OpenAI-compatible API + RTX 5090 path is suitable for serving and basic smoke testing, but it does not yet reproduce the reference evaluation setup cleanly enough for publication-quality benchmark numbers:
- the original
leaderboard_gpqa_diamondpath understated quality because it used a2047token limit before the local harness wrapper was fixed - the current
chat-completionspath on this RTX 5090 / patched vLLM stack can emit reasoning in a separate field while leavingmessage.contentempty whenthinkingis enabled, which makes benchmark parity with the reference release unreliable
Benchmarks will be added back once GPQA Diamond, IFEval, and MMLU-Redux have been rerun with a reproducible configuration that matches the intended evaluation protocol.
Notes
- Architecture:
Qwen3_5ForConditionalGeneration - Pipeline tag:
image-text-to-text - Quantization format:
nvfp4-pack-quantized - Repository layout intentionally includes
processor_config.json,preprocessor_config.json, andvideo_preprocessor_config.jsonso multimodal preprocessing remains available.
- Downloads last month
- 114
Model tree for lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4
Base model
Qwen/Qwen3.5-27B