Instructions to use walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF

SGLang

How to use walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF with Docker Model Runner:
```
docker model run hf.co/walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF
```

Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF / README.md

walissoncasonatto

Add README

7121c03 verified 19 days ago

preview code

raw

history blame

7.32 kB

	---
	library_name: transformers
	license: apache-2.0
	license_link: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE
	pipeline_tag: text-generation
	base_model:
	- huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF
	base_model_relation: quantized
	tags:
	- gguf
	- llama.cpp
	- quantized
	- nvfp4
	- mtp
	- qwen
	- qwen3.6
	- abliterated
	- uncensored
	- blackwell
	- rtx-5090
	language:
	- en
	- zh
	---

	# Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF

	NVFP4 quantization of [huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF](https://huggingface.co/huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF) with the MTP (Multi-Token Prediction) head preserved in Q4_K. Targeted at NVIDIA Blackwell consumer/edge GPUs (sm_120/sm_121) such as the RTX 5090.

	The motivation: huihui-ai publishes Huihui abliterated GGUFs at Q2_K through Q8_0 (no NVFP4 variant exists). Standard Q-K quants don't hit Blackwell's native FP4 tensor cores. This conversion follows the recipe documented by [s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF](https://huggingface.co/s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF) — body tensors in NVFP4 (GGML type 40) for tensor-core acceleration, MTP head in Q4_K for draft quality, norms/biases in F32.

	## Why NVFP4 on Blackwell

	NVFP4 is NVIDIA's native 4-bit floating point format for Blackwell. Unlike integer quantization (Q4_K, Q5_K, etc.), NVFP4 uses block floating point with E4M3 scale factors and is dequantized directly by the GPU's tensor cores. The benefits:

	- Hardware-native dequantization — no integer-to-float conversion overhead
	- Lower memory bandwidth — body at ~4.6 BPW vs ~5.5 BPW (Q5_K) or ~6.6 BPW (Q8_0)
	- Acceptable quality — block scaling preserves more information than uniform 4-bit
	- Speed boost — measured ~20-30% over Q5_K_M on RTX 5090 at the same context

	## Source

	Quantized from `huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF` (Q8_0 variant, 29 GB) via `llama-quantize --allow-requantize --tensor-type nvfp4 ... Q4_K`.

	`huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF` itself is an abliterated derivative of [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B), with refusal directions zeroed out (see [remove-refusals-with-transformers](https://github.com/Sumandora/remove-refusals-with-transformers)). The MTP head was preserved by huihui-ai in their GGUF release (published post-llama.cpp b9180 which added MTP convert/quantize support).

	## Files

	\| File \| Quant \| Size \| Notes \|
	\|---\|---\|---\|---\|
	\| `Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP.gguf` \| NVFP4 body + Q4_K MTP \| ~15-16 GB \| Recommended for RTX 5090 / GB10 / RTX PRO 6000 \|

	## Usage

	### Requirements

	- llama.cpp build with NVFP4 inference enabled (`BLACKWELL_NATIVE_FP4=1` in `system_info`). Mainline b9180+ on a CUDA 13 toolkit + Blackwell GPU has this by default.
	- Build flags used during compile: `-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_CUDA_FA_ALL_QUANTS=ON`.

	### Server (Windows, copy/paste, adapt paths)

	```powershell
	.\llama-server.exe `
	-m "Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP.gguf" `
	--spec-type draft-mtp `
	--host 0.0.0.0 --port 5001 `
	-ngl all --kv-unified -np 1 -b 2048 -ub 512 `
	--ctx-size 196608 `
	--cache-type-k q8_0 --cache-type-v q8_0 `
	--flash-attn on --cache-ram 0 --jinja --no-mmap --mlock `
	--reasoning on --reasoning-budget 8192 `
	--metrics `
	--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0
	```

	### Linux / DGX Spark / kneutral

	Same flags, drop the `.exe` and `^` line continuations. On GB10 (DGX Spark) also pass `--no-mmap` due to unified-memory mmap slowdowns.

	## Measured performance

	Benchmarked on personal RTX 5090 (32 GB GDDR7, 1792 GB/s, sm_120a), Windows 11, CUDA 13.2, driver 596.36, llama.cpp mainline `0d18aaa9d1a8af3df9abccd828e22eeaac7f840b` (May 26 2026), MTP `--spec-type draft-mtp` default n_max=3, Q8 KV cache, 196k context.

	### Quality — multi-seed 90/90 PERFECT

	10 independent runs × 9 probes (q1q5 + reasoning) = 90/90 probes pass, 100%, zero retries needed. Matches the multi-seed reference established on kneutral RTX PRO 6000 with the same base model. avgTPS during multi-seed: 93 t/s on q1q5, 95.9 t/s on reasoning suite.

	### Single-pass q1q5 smoke (representative timings)

	5/5 q1q5 smoke PASS with `--reasoning on --reasoning-budget 8192`:

	\| Probe \| Tokens \| TPS \|
	\|---\|---\|---\|
	\| Q1 Tool call (calculator) \| 106 \| 70.4 \|
	\| Q2 Strict JSON extraction \| 347 \| 97.8 \|
	\| Q3 Go `[]rune` UTF-8 reverse \| 1119 \| 95.4 \|
	\| Q4 CRT reasoning (bat-and-ball trap) \| 8366 \| 107.4 \|
	\| Q5 Long-prompt multi-section + FIM marker \| 2636 \| 100.7 \|

	### Throughput sweep (default n_max, `temperature=0.6`)

	\| Workload \| Tokens generated \| MTP acceptance \| TPS \|
	\|---\|---\|---\|---\|
	\| Short code (palindrome function) \| 256 \| 70.9% \| 88.9 \|
	\| Short prose (CAP theorem) \| 256 \| 70.3% \| 92.5 \|
	\| Long-form tech (TCP vs UDP) \| 256 \| 68.4% \| 89.2 \|
	\| Sustained long code (LRU cache class) \| 1024 \| 68.5% \| 91.7 \|

	MTP acceptance is highly consistent (68-71% across all workload categories) — predictable performance regardless of prompt domain. Compare to the same model in Q5_K_M which swings 45-78% acceptance and 71-99 t/s depending on workload.

	### vs other variants on the same hardware

	\| Variant \| File size \| Sustained TPS @ 1024 \| Quality \| Notes \|
	\|---\|---\|---\|---\|---\|
	\| `s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF` (Qwen base, not Huihui) \| 14.64 GB \| 105 t/s peak (different bench) \| 90/90 multi-seed (2 retries) \| Baseline Qwen quality \|
	\| `huihui-ai/Huihui-...-Q5_K.gguf` (this model, Q5_K_M) \| 18.19 GB \| 75-85 t/s \| 5/5 q1q5 \| Highly workload-dependent \|
	\| `Huihui-...-NVFP4-MTP.gguf` (this repo) \| 19.65 GB \| 91-107 t/s \| 5/5 q1q5 \| Best balance: kneutral-validated quality + Blackwell-native speed + predictable acceptance \|

	### VRAM

	\| \| \|
	\|---\|---\|
	\| Model on GPU \| 28.9 GB \|
	\| Free for OS / display / margin \| 3.1 GB \|
	\| Context capacity (q8 KV) \| 196k full + 32 token output budget — paged KV handles mixed sizes up to ~12× concurrency at 16k each \|

	## Limitations

	- Blackwell-only fast path. Will run on older NVIDIA GPUs via emulated dequantization (slow). For Ampere/Ada/older, use the standard quants from [huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF](https://huggingface.co/huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF).
	- `-np 1` required for MTP. Multi-token prediction speculative decoding currently requires single-parallel mode in llama.cpp.
	- `--mmproj` incompatible with MTP in mainline llama.cpp. Drop the vision projector if not needed (this is a text-only file regardless).
	- Abliteration tradeoffs. Refusal-direction surgery occasionally affects benign refusals (legal/safety information). Validate against your workload before production.

	## Credits

	- Model author: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) (Alibaba Cloud / Qwen Team)
	- Abliteration + MTP-GGUF source: [huihui-ai](https://huggingface.co/huihui-ai)
	- NVFP4 + GGUF conversion recipe: [s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF](https://huggingface.co/s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF)
	- MTP support in llama.cpp: [PR #22673](https://github.com/ggml-org/llama.cpp/pull/22673)

	## License

	Apache 2.0, same as `Qwen/Qwen3.6-27B`. See [LICENSE](LICENSE).