Text Generation
Transformers
GGUF
English
Chinese
llama.cpp
quantized
nvfp4
mtp
qwen
qwen3.6
abliterated
uncensored
blackwell
rtx-5090
Instructions to use walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF
- SGLang
How to use walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF with Docker Model Runner:
docker model run hf.co/walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF
File size: 7,315 Bytes
7121c03 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | ---
library_name: transformers
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE
pipeline_tag: text-generation
base_model:
- huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF
base_model_relation: quantized
tags:
- gguf
- llama.cpp
- quantized
- nvfp4
- mtp
- qwen
- qwen3.6
- abliterated
- uncensored
- blackwell
- rtx-5090
language:
- en
- zh
---
# Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF
NVFP4 quantization of [huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF](https://huggingface.co/huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF) with the MTP (Multi-Token Prediction) head preserved in Q4_K. Targeted at NVIDIA Blackwell consumer/edge GPUs (sm_120/sm_121) such as the RTX 5090.
The motivation: huihui-ai publishes Huihui abliterated GGUFs at Q2_K through Q8_0 (no NVFP4 variant exists). Standard Q-K quants don't hit Blackwell's native FP4 tensor cores. This conversion follows the recipe documented by [s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF](https://huggingface.co/s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF) — body tensors in NVFP4 (GGML type 40) for tensor-core acceleration, MTP head in Q4_K for draft quality, norms/biases in F32.
## Why NVFP4 on Blackwell
NVFP4 is NVIDIA's native 4-bit floating point format for Blackwell. Unlike integer quantization (Q4_K, Q5_K, etc.), NVFP4 uses block floating point with E4M3 scale factors and is dequantized directly by the GPU's tensor cores. The benefits:
- **Hardware-native dequantization** — no integer-to-float conversion overhead
- **Lower memory bandwidth** — body at ~4.6 BPW vs ~5.5 BPW (Q5_K) or ~6.6 BPW (Q8_0)
- **Acceptable quality** — block scaling preserves more information than uniform 4-bit
- **Speed boost** — measured ~20-30% over Q5_K_M on RTX 5090 at the same context
## Source
Quantized from `huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF` (Q8_0 variant, 29 GB) via `llama-quantize --allow-requantize --tensor-type nvfp4 ... Q4_K`.
`huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF` itself is an abliterated derivative of [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B), with refusal directions zeroed out (see [remove-refusals-with-transformers](https://github.com/Sumandora/remove-refusals-with-transformers)). The MTP head was preserved by huihui-ai in their GGUF release (published post-llama.cpp b9180 which added MTP convert/quantize support).
## Files
| File | Quant | Size | Notes |
|---|---|---|---|
| `Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP.gguf` | NVFP4 body + Q4_K MTP | ~15-16 GB | Recommended for RTX 5090 / GB10 / RTX PRO 6000 |
## Usage
### Requirements
- llama.cpp build with NVFP4 inference enabled (`BLACKWELL_NATIVE_FP4=1` in `system_info`). Mainline b9180+ on a CUDA 13 toolkit + Blackwell GPU has this by default.
- Build flags used during compile: `-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_CUDA_FA_ALL_QUANTS=ON`.
### Server (Windows, copy/paste, adapt paths)
```powershell
.\llama-server.exe `
-m "Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP.gguf" `
--spec-type draft-mtp `
--host 0.0.0.0 --port 5001 `
-ngl all --kv-unified -np 1 -b 2048 -ub 512 `
--ctx-size 196608 `
--cache-type-k q8_0 --cache-type-v q8_0 `
--flash-attn on --cache-ram 0 --jinja --no-mmap --mlock `
--reasoning on --reasoning-budget 8192 `
--metrics `
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0
```
### Linux / DGX Spark / kneutral
Same flags, drop the `.exe` and `^` line continuations. On GB10 (DGX Spark) also pass `--no-mmap` due to unified-memory mmap slowdowns.
## Measured performance
Benchmarked on personal RTX 5090 (32 GB GDDR7, 1792 GB/s, sm_120a), Windows 11, CUDA 13.2, driver 596.36, llama.cpp mainline `0d18aaa9d1a8af3df9abccd828e22eeaac7f840b` (May 26 2026), MTP `--spec-type draft-mtp` default n_max=3, Q8 KV cache, 196k context.
### Quality — multi-seed 90/90 PERFECT
10 independent runs × 9 probes (q1q5 + reasoning) = **90/90 probes pass, 100%**, zero retries needed. Matches the multi-seed reference established on kneutral RTX PRO 6000 with the same base model. avgTPS during multi-seed: **93 t/s on q1q5, 95.9 t/s on reasoning suite**.
### Single-pass q1q5 smoke (representative timings)
**5/5 q1q5 smoke PASS** with `--reasoning on --reasoning-budget 8192`:
| Probe | Tokens | TPS |
|---|---|---|
| Q1 Tool call (calculator) | 106 | 70.4 |
| Q2 Strict JSON extraction | 347 | 97.8 |
| Q3 Go `[]rune` UTF-8 reverse | 1119 | 95.4 |
| **Q4 CRT reasoning (bat-and-ball trap)** | **8366** | **107.4** |
| Q5 Long-prompt multi-section + FIM marker | 2636 | 100.7 |
### Throughput sweep (default n_max, `temperature=0.6`)
| Workload | Tokens generated | MTP acceptance | TPS |
|---|---|---|---|
| Short code (palindrome function) | 256 | 70.9% | 88.9 |
| Short prose (CAP theorem) | 256 | 70.3% | 92.5 |
| Long-form tech (TCP vs UDP) | 256 | 68.4% | 89.2 |
| **Sustained long code (LRU cache class)** | **1024** | **68.5%** | **91.7** |
MTP acceptance is **highly consistent (68-71% across all workload categories)** — predictable performance regardless of prompt domain. Compare to the same model in Q5_K_M which swings 45-78% acceptance and 71-99 t/s depending on workload.
### vs other variants on the same hardware
| Variant | File size | Sustained TPS @ 1024 | Quality | Notes |
|---|---|---|---|---|
| `s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF` (Qwen base, not Huihui) | 14.64 GB | 105 t/s peak (different bench) | 90/90 multi-seed (2 retries) | Baseline Qwen quality |
| `huihui-ai/Huihui-...-Q5_K.gguf` (this model, Q5_K_M) | 18.19 GB | 75-85 t/s | 5/5 q1q5 | Highly workload-dependent |
| **`Huihui-...-NVFP4-MTP.gguf` (this repo)** | **19.65 GB** | **91-107 t/s** | **5/5 q1q5** | **Best balance: kneutral-validated quality + Blackwell-native speed + predictable acceptance** |
### VRAM
| | |
|---|---|
| Model on GPU | 28.9 GB |
| Free for OS / display / margin | 3.1 GB |
| Context capacity (q8 KV) | 196k full + 32 token output budget — paged KV handles mixed sizes up to ~12× concurrency at 16k each |
## Limitations
- **Blackwell-only fast path.** Will run on older NVIDIA GPUs via emulated dequantization (slow). For Ampere/Ada/older, use the standard quants from [huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF](https://huggingface.co/huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF).
- **`-np 1` required for MTP.** Multi-token prediction speculative decoding currently requires single-parallel mode in llama.cpp.
- **`--mmproj` incompatible with MTP** in mainline llama.cpp. Drop the vision projector if not needed (this is a text-only file regardless).
- **Abliteration tradeoffs.** Refusal-direction surgery occasionally affects benign refusals (legal/safety information). Validate against your workload before production.
## Credits
- **Model author:** [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) (Alibaba Cloud / Qwen Team)
- **Abliteration + MTP-GGUF source:** [huihui-ai](https://huggingface.co/huihui-ai)
- **NVFP4 + GGUF conversion recipe:** [s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF](https://huggingface.co/s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF)
- **MTP support in llama.cpp:** [PR #22673](https://github.com/ggml-org/llama.cpp/pull/22673)
## License
Apache 2.0, same as `Qwen/Qwen3.6-27B`. See [LICENSE](LICENSE).
|