Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF

NVFP4 quantization of huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF with the MTP (Multi-Token Prediction) head preserved in Q4_K. Targeted at NVIDIA Blackwell consumer/edge GPUs (sm_120/sm_121) such as the RTX 5090.

The motivation: huihui-ai publishes Huihui abliterated GGUFs at Q2_K through Q8_0 (no NVFP4 variant exists). Standard Q-K quants don't hit Blackwell's native FP4 tensor cores. This conversion follows the recipe documented by s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF โ€” body tensors in NVFP4 (GGML type 40) for tensor-core acceleration, MTP head in Q4_K for draft quality, norms/biases in F32.

Why NVFP4 on Blackwell

NVFP4 is NVIDIA's native 4-bit floating point format for Blackwell. Unlike integer quantization (Q4_K, Q5_K, etc.), NVFP4 uses block floating point with E4M3 scale factors and is dequantized directly by the GPU's tensor cores. The benefits:

  • Hardware-native dequantization โ€” no integer-to-float conversion overhead
  • Lower memory bandwidth โ€” body at ~4.6 BPW vs ~5.5 BPW (Q5_K) or ~6.6 BPW (Q8_0)
  • Acceptable quality โ€” block scaling preserves more information than uniform 4-bit
  • Speed boost โ€” measured ~20-30% over Q5_K_M on RTX 5090 at the same context

Source

Quantized from huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF (Q8_0 variant, 29 GB) via llama-quantize --allow-requantize --tensor-type nvfp4 ... Q4_K.

huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF itself is an abliterated derivative of Qwen/Qwen3.6-27B, with refusal directions zeroed out (see remove-refusals-with-transformers). The MTP head was preserved by huihui-ai in their GGUF release (published post-llama.cpp b9180 which added MTP convert/quantize support).

Files

File Quant Size Notes
Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP.gguf NVFP4 body + Q4_K MTP ~15-16 GB Recommended for RTX 5090 / GB10 / RTX PRO 6000

Usage

Requirements

  • llama.cpp build with NVFP4 inference enabled (BLACKWELL_NATIVE_FP4=1 in system_info). Mainline b9180+ on a CUDA 13 toolkit + Blackwell GPU has this by default.
  • Build flags used during compile: -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_CUDA_FA_ALL_QUANTS=ON.

Server (Windows, copy/paste, adapt paths)

.\llama-server.exe `
  -m "Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP.gguf" `
  --spec-type draft-mtp `
  --host 0.0.0.0 --port 5001 `
  -ngl all --kv-unified -np 1 -b 2048 -ub 512 `
  --ctx-size 196608 `
  --cache-type-k q8_0 --cache-type-v q8_0 `
  --flash-attn on --cache-ram 0 --jinja --no-mmap --mlock `
  --reasoning on --reasoning-budget 8192 `
  --metrics `
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0

Linux / DGX Spark

Same flags, drop the .exe and ^ line continuations. On GB10 (DGX Spark) also pass --no-mmap due to unified-memory mmap slowdowns.

Measured performance

Benchmarked on personal RTX 5090 (32 GB GDDR7, 1792 GB/s, sm_120a), Windows 11, CUDA 13.2, driver 596.36, llama.cpp mainline 0d18aaa9d1a8af3df9abccd828e22eeaac7f840b (May 26 2026), MTP --spec-type draft-mtp default n_max=3, Q8 KV cache, 196k context.

Quality โ€” multi-seed 90/90 PERFECT

10 independent runs ร— 9 probes (q1q5 + reasoning) = 90/90 probes pass, 100%, zero retries needed. avgTPS during multi-seed: 93 t/s on q1q5, 95.9 t/s on reasoning suite.

Single-pass q1q5 smoke (representative timings)

5/5 q1q5 smoke PASS with --reasoning on --reasoning-budget 8192:

Probe Tokens TPS
Q1 Tool call (calculator) 106 70.4
Q2 Strict JSON extraction 347 97.8
Q3 Go []rune UTF-8 reverse 1119 95.4
Q4 CRT reasoning (bat-and-ball trap) 8366 107.4
Q5 Long-prompt multi-section + FIM marker 2636 100.7

Throughput sweep (default n_max, temperature=0.6)

Workload Tokens generated MTP acceptance TPS
Short code (palindrome function) 256 70.9% 88.9
Short prose (CAP theorem) 256 70.3% 92.5
Long-form tech (TCP vs UDP) 256 68.4% 89.2
Sustained long code (LRU cache class) 1024 68.5% 91.7

MTP acceptance is highly consistent (68-71% across all workload categories) โ€” predictable performance regardless of prompt domain. Compare to the same model in Q5_K_M which swings 45-78% acceptance and 71-99 t/s depending on workload.

vs other variants on the same hardware

Variant File size Sustained TPS @ 1024 Quality Notes
s-batman/Qwen3.6-27B-NVFP4-MTP-GGUF (Qwen base, not Huihui) 14.64 GB 105 t/s peak (different bench) 90/90 multi-seed (2 retries) Baseline Qwen quality
huihui-ai/Huihui-...-Q5_K.gguf (this model, Q5_K_M) 18.19 GB 75-85 t/s 5/5 q1q5 Highly workload-dependent
Huihui-...-NVFP4-MTP.gguf (this repo) 19.65 GB 91-107 t/s 5/5 q1q5 Best balance: abliterated quality + Blackwell-native speed + predictable acceptance

VRAM

Model on GPU 28.9 GB
Free for OS / display / margin 3.1 GB
Context capacity (q8 KV) 196k full + 32 token output budget โ€” paged KV handles mixed sizes up to ~12ร— concurrency at 16k each

Limitations

  • Blackwell-only fast path. Will run on older NVIDIA GPUs via emulated dequantization (slow). For Ampere/Ada/older, use the standard quants from huihui-ai/Huihui-Qwen3.6-27B-abliterated-MTP-GGUF.
  • -np 1 required for MTP. Multi-token prediction speculative decoding currently requires single-parallel mode in llama.cpp.
  • --mmproj incompatible with MTP in mainline llama.cpp. Drop the vision projector if not needed (this is a text-only file regardless).
  • Abliteration tradeoffs. Refusal-direction surgery occasionally affects benign refusals (legal/safety information). Validate against your workload before production.

Credits

License

Apache 2.0, same as Qwen/Qwen3.6-27B. See LICENSE.

Downloads last month
1,480
GGUF
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for walissoncasonatto/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(1)
this model