File size: 7,315 Bytes
7121c03
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
library_name: transformers
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE
pipeline_tag: text-generation
base_model:
  - huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF
base_model_relation: quantized
tags:
  - gguf
  - llama.cpp
  - quantized
  - nvfp4
  - mtp
  - qwen
  - qwen3.6
  - abliterated
  - uncensored
  - blackwell
  - rtx-5090
language:
  - en
  - zh
---

# Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF

NVFP4 quantization of [huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF](https://huggingface.co/huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF) with the MTP (Multi-Token Prediction) head preserved in Q4_K. Targeted at NVIDIA Blackwell consumer/edge GPUs (sm_120/sm_121) such as the RTX 5090.

The motivation: huihui-ai publishes Huihui abliterated GGUFs at Q2_K through Q8_0 (no NVFP4 variant exists). Standard Q-K quants don't hit Blackwell's native FP4 tensor cores. This conversion follows the recipe documented by [s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF](https://huggingface.co/s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF) — body tensors in NVFP4 (GGML type 40) for tensor-core acceleration, MTP head in Q4_K for draft quality, norms/biases in F32.

## Why NVFP4 on Blackwell

NVFP4 is NVIDIA's native 4-bit floating point format for Blackwell. Unlike integer quantization (Q4_K, Q5_K, etc.), NVFP4 uses block floating point with E4M3 scale factors and is dequantized directly by the GPU's tensor cores. The benefits:

- **Hardware-native dequantization** — no integer-to-float conversion overhead
- **Lower memory bandwidth** — body at ~4.6 BPW vs ~5.5 BPW (Q5_K) or ~6.6 BPW (Q8_0)
- **Acceptable quality** — block scaling preserves more information than uniform 4-bit
- **Speed boost** — measured ~20-30% over Q5_K_M on RTX 5090 at the same context

## Source

Quantized from `huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF` (Q8_0 variant, 29 GB) via `llama-quantize --allow-requantize --tensor-type nvfp4 ... Q4_K`.

`huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF` itself is an abliterated derivative of [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B), with refusal directions zeroed out (see [remove-refusals-with-transformers](https://github.com/Sumandora/remove-refusals-with-transformers)). The MTP head was preserved by huihui-ai in their GGUF release (published post-llama.cpp b9180 which added MTP convert/quantize support).

## Files

| File | Quant | Size | Notes |
|---|---|---|---|
| `Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP.gguf` | NVFP4 body + Q4_K MTP | ~15-16 GB | Recommended for RTX 5090 / GB10 / RTX PRO 6000 |

## Usage

### Requirements

- llama.cpp build with NVFP4 inference enabled (`BLACKWELL_NATIVE_FP4=1` in `system_info`). Mainline b9180+ on a CUDA 13 toolkit + Blackwell GPU has this by default.
- Build flags used during compile: `-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_CUDA_FA_ALL_QUANTS=ON`.

### Server (Windows, copy/paste, adapt paths)

```powershell
.\llama-server.exe `
  -m "Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP.gguf" `
  --spec-type draft-mtp `
  --host 0.0.0.0 --port 5001 `
  -ngl all --kv-unified -np 1 -b 2048 -ub 512 `
  --ctx-size 196608 `
  --cache-type-k q8_0 --cache-type-v q8_0 `
  --flash-attn on --cache-ram 0 --jinja --no-mmap --mlock `
  --reasoning on --reasoning-budget 8192 `
  --metrics `
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0
```

### Linux / DGX Spark / kneutral

Same flags, drop the `.exe` and `^` line continuations. On GB10 (DGX Spark) also pass `--no-mmap` due to unified-memory mmap slowdowns.

## Measured performance

Benchmarked on personal RTX 5090 (32 GB GDDR7, 1792 GB/s, sm_120a), Windows 11, CUDA 13.2, driver 596.36, llama.cpp mainline `0d18aaa9d1a8af3df9abccd828e22eeaac7f840b` (May 26 2026), MTP `--spec-type draft-mtp` default n_max=3, Q8 KV cache, 196k context.

### Quality — multi-seed 90/90 PERFECT

10 independent runs × 9 probes (q1q5 + reasoning) = **90/90 probes pass, 100%**, zero retries needed. Matches the multi-seed reference established on kneutral RTX PRO 6000 with the same base model. avgTPS during multi-seed: **93 t/s on q1q5, 95.9 t/s on reasoning suite**.

### Single-pass q1q5 smoke (representative timings)

**5/5 q1q5 smoke PASS** with `--reasoning on --reasoning-budget 8192`:

| Probe | Tokens | TPS |
|---|---|---|
| Q1 Tool call (calculator) | 106 | 70.4 |
| Q2 Strict JSON extraction | 347 | 97.8 |
| Q3 Go `[]rune` UTF-8 reverse | 1119 | 95.4 |
| **Q4 CRT reasoning (bat-and-ball trap)** | **8366** | **107.4** |
| Q5 Long-prompt multi-section + FIM marker | 2636 | 100.7 |

### Throughput sweep (default n_max, `temperature=0.6`)

| Workload | Tokens generated | MTP acceptance | TPS |
|---|---|---|---|
| Short code (palindrome function) | 256 | 70.9% | 88.9 |
| Short prose (CAP theorem) | 256 | 70.3% | 92.5 |
| Long-form tech (TCP vs UDP) | 256 | 68.4% | 89.2 |
| **Sustained long code (LRU cache class)** | **1024** | **68.5%** | **91.7** |

MTP acceptance is **highly consistent (68-71% across all workload categories)** — predictable performance regardless of prompt domain. Compare to the same model in Q5_K_M which swings 45-78% acceptance and 71-99 t/s depending on workload.

### vs other variants on the same hardware

| Variant | File size | Sustained TPS @ 1024 | Quality | Notes |
|---|---|---|---|---|
| `s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF` (Qwen base, not Huihui) | 14.64 GB | 105 t/s peak (different bench) | 90/90 multi-seed (2 retries) | Baseline Qwen quality |
| `huihui-ai/Huihui-...-Q5_K.gguf` (this model, Q5_K_M) | 18.19 GB | 75-85 t/s | 5/5 q1q5 | Highly workload-dependent |
| **`Huihui-...-NVFP4-MTP.gguf` (this repo)** | **19.65 GB** | **91-107 t/s** | **5/5 q1q5** | **Best balance: kneutral-validated quality + Blackwell-native speed + predictable acceptance** |

### VRAM

| | |
|---|---|
| Model on GPU | 28.9 GB |
| Free for OS / display / margin | 3.1 GB |
| Context capacity (q8 KV) | 196k full + 32 token output budget — paged KV handles mixed sizes up to ~12× concurrency at 16k each |

## Limitations

- **Blackwell-only fast path.** Will run on older NVIDIA GPUs via emulated dequantization (slow). For Ampere/Ada/older, use the standard quants from [huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF](https://huggingface.co/huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF).
- **`-np 1` required for MTP.** Multi-token prediction speculative decoding currently requires single-parallel mode in llama.cpp.
- **`--mmproj` incompatible with MTP** in mainline llama.cpp. Drop the vision projector if not needed (this is a text-only file regardless).
- **Abliteration tradeoffs.** Refusal-direction surgery occasionally affects benign refusals (legal/safety information). Validate against your workload before production.

## Credits

- **Model author:** [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) (Alibaba Cloud / Qwen Team)
- **Abliteration + MTP-GGUF source:** [huihui-ai](https://huggingface.co/huihui-ai)
- **NVFP4 + GGUF conversion recipe:** [s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF](https://huggingface.co/s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF)
- **MTP support in llama.cpp:** [PR #22673](https://github.com/ggml-org/llama.cpp/pull/22673)

## License

Apache 2.0, same as `Qwen/Qwen3.6-27B`. See [LICENSE](LICENSE).