Text Generation
Transformers
GGUF
English
Chinese
llama.cpp
quantized
nvfp4
mtp
qwen
qwen3.6
abliterated
uncensored
blackwell
rtx-5090
Instructions to use walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF
- SGLang
How to use walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF with Docker Model Runner:
docker model run hf.co/walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF
| library_name: transformers | |
| license: apache-2.0 | |
| license_link: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE | |
| pipeline_tag: text-generation | |
| base_model: | |
| - huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF | |
| base_model_relation: quantized | |
| tags: | |
| - gguf | |
| - llama.cpp | |
| - quantized | |
| - nvfp4 | |
| - mtp | |
| - qwen | |
| - qwen3.6 | |
| - abliterated | |
| - uncensored | |
| - blackwell | |
| - rtx-5090 | |
| language: | |
| - en | |
| - zh | |
| # Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF | |
| NVFP4 quantization of [huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF](https://huggingface.co/huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF) with the MTP (Multi-Token Prediction) head preserved in Q4_K. Targeted at NVIDIA Blackwell consumer/edge GPUs (sm_120/sm_121) such as the RTX 5090. | |
| The motivation: huihui-ai publishes Huihui abliterated GGUFs at Q2_K through Q8_0 (no NVFP4 variant exists). Standard Q-K quants don't hit Blackwell's native FP4 tensor cores. This conversion follows the recipe documented by [s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF](https://huggingface.co/s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF) — body tensors in NVFP4 (GGML type 40) for tensor-core acceleration, MTP head in Q4_K for draft quality, norms/biases in F32. | |
| ## Why NVFP4 on Blackwell | |
| NVFP4 is NVIDIA's native 4-bit floating point format for Blackwell. Unlike integer quantization (Q4_K, Q5_K, etc.), NVFP4 uses block floating point with E4M3 scale factors and is dequantized directly by the GPU's tensor cores. The benefits: | |
| - **Hardware-native dequantization** — no integer-to-float conversion overhead | |
| - **Lower memory bandwidth** — body at ~4.6 BPW vs ~5.5 BPW (Q5_K) or ~6.6 BPW (Q8_0) | |
| - **Acceptable quality** — block scaling preserves more information than uniform 4-bit | |
| - **Speed boost** — measured ~20-30% over Q5_K_M on RTX 5090 at the same context | |
| ## Source | |
| Quantized from `huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF` (Q8_0 variant, 29 GB) via `llama-quantize --allow-requantize --tensor-type nvfp4 ... Q4_K`. | |
| `huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF` itself is an abliterated derivative of [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B), with refusal directions zeroed out (see [remove-refusals-with-transformers](https://github.com/Sumandora/remove-refusals-with-transformers)). The MTP head was preserved by huihui-ai in their GGUF release (published post-llama.cpp b9180 which added MTP convert/quantize support). | |
| ## Files | |
| | File | Quant | Size | Notes | | |
| |---|---|---|---| | |
| | `Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP.gguf` | NVFP4 body + Q4_K MTP | ~15-16 GB | Recommended for RTX 5090 / GB10 / RTX PRO 6000 | | |
| ## Usage | |
| ### Requirements | |
| - llama.cpp build with NVFP4 inference enabled (`BLACKWELL_NATIVE_FP4=1` in `system_info`). Mainline b9180+ on a CUDA 13 toolkit + Blackwell GPU has this by default. | |
| - Build flags used during compile: `-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_CUDA_FA_ALL_QUANTS=ON`. | |
| ### Server (Windows, copy/paste, adapt paths) | |
| ```powershell | |
| .\llama-server.exe ` | |
| -m "Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP.gguf" ` | |
| --spec-type draft-mtp ` | |
| --host 0.0.0.0 --port 5001 ` | |
| -ngl all --kv-unified -np 1 -b 2048 -ub 512 ` | |
| --ctx-size 196608 ` | |
| --cache-type-k q8_0 --cache-type-v q8_0 ` | |
| --flash-attn on --cache-ram 0 --jinja --no-mmap --mlock ` | |
| --reasoning on --reasoning-budget 8192 ` | |
| --metrics ` | |
| --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 | |
| ``` | |
| ### Linux / DGX Spark / kneutral | |
| Same flags, drop the `.exe` and `^` line continuations. On GB10 (DGX Spark) also pass `--no-mmap` due to unified-memory mmap slowdowns. | |
| ## Measured performance | |
| Benchmarked on personal RTX 5090 (32 GB GDDR7, 1792 GB/s, sm_120a), Windows 11, CUDA 13.2, driver 596.36, llama.cpp mainline `0d18aaa9d1a8af3df9abccd828e22eeaac7f840b` (May 26 2026), MTP `--spec-type draft-mtp` default n_max=3, Q8 KV cache, 196k context. | |
| ### Quality — multi-seed 90/90 PERFECT | |
| 10 independent runs × 9 probes (q1q5 + reasoning) = **90/90 probes pass, 100%**, zero retries needed. Matches the multi-seed reference established on kneutral RTX PRO 6000 with the same base model. avgTPS during multi-seed: **93 t/s on q1q5, 95.9 t/s on reasoning suite**. | |
| ### Single-pass q1q5 smoke (representative timings) | |
| **5/5 q1q5 smoke PASS** with `--reasoning on --reasoning-budget 8192`: | |
| | Probe | Tokens | TPS | | |
| |---|---|---| | |
| | Q1 Tool call (calculator) | 106 | 70.4 | | |
| | Q2 Strict JSON extraction | 347 | 97.8 | | |
| | Q3 Go `[]rune` UTF-8 reverse | 1119 | 95.4 | | |
| | **Q4 CRT reasoning (bat-and-ball trap)** | **8366** | **107.4** | | |
| | Q5 Long-prompt multi-section + FIM marker | 2636 | 100.7 | | |
| ### Throughput sweep (default n_max, `temperature=0.6`) | |
| | Workload | Tokens generated | MTP acceptance | TPS | | |
| |---|---|---|---| | |
| | Short code (palindrome function) | 256 | 70.9% | 88.9 | | |
| | Short prose (CAP theorem) | 256 | 70.3% | 92.5 | | |
| | Long-form tech (TCP vs UDP) | 256 | 68.4% | 89.2 | | |
| | **Sustained long code (LRU cache class)** | **1024** | **68.5%** | **91.7** | | |
| MTP acceptance is **highly consistent (68-71% across all workload categories)** — predictable performance regardless of prompt domain. Compare to the same model in Q5_K_M which swings 45-78% acceptance and 71-99 t/s depending on workload. | |
| ### vs other variants on the same hardware | |
| | Variant | File size | Sustained TPS @ 1024 | Quality | Notes | | |
| |---|---|---|---|---| | |
| | `s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF` (Qwen base, not Huihui) | 14.64 GB | 105 t/s peak (different bench) | 90/90 multi-seed (2 retries) | Baseline Qwen quality | | |
| | `huihui-ai/Huihui-...-Q5_K.gguf` (this model, Q5_K_M) | 18.19 GB | 75-85 t/s | 5/5 q1q5 | Highly workload-dependent | | |
| | **`Huihui-...-NVFP4-MTP.gguf` (this repo)** | **19.65 GB** | **91-107 t/s** | **5/5 q1q5** | **Best balance: kneutral-validated quality + Blackwell-native speed + predictable acceptance** | | |
| ### VRAM | |
| | | | | |
| |---|---| | |
| | Model on GPU | 28.9 GB | | |
| | Free for OS / display / margin | 3.1 GB | | |
| | Context capacity (q8 KV) | 196k full + 32 token output budget — paged KV handles mixed sizes up to ~12× concurrency at 16k each | | |
| ## Limitations | |
| - **Blackwell-only fast path.** Will run on older NVIDIA GPUs via emulated dequantization (slow). For Ampere/Ada/older, use the standard quants from [huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF](https://huggingface.co/huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF). | |
| - **`-np 1` required for MTP.** Multi-token prediction speculative decoding currently requires single-parallel mode in llama.cpp. | |
| - **`--mmproj` incompatible with MTP** in mainline llama.cpp. Drop the vision projector if not needed (this is a text-only file regardless). | |
| - **Abliteration tradeoffs.** Refusal-direction surgery occasionally affects benign refusals (legal/safety information). Validate against your workload before production. | |
| ## Credits | |
| - **Model author:** [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) (Alibaba Cloud / Qwen Team) | |
| - **Abliteration + MTP-GGUF source:** [huihui-ai](https://huggingface.co/huihui-ai) | |
| - **NVFP4 + GGUF conversion recipe:** [s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF](https://huggingface.co/s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF) | |
| - **MTP support in llama.cpp:** [PR #22673](https://github.com/ggml-org/llama.cpp/pull/22673) | |
| ## License | |
| Apache 2.0, same as `Qwen/Qwen3.6-27B`. See [LICENSE](LICENSE). | |