---
license: apache-2.0
base_model: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16
language:
- en
- zh
- multilingual
library_name: transformers
pipeline_tag: text-generation
tags:
- abliterated
- uncensored
- qwen3
- qwen3.6
- nvfp4
- modelopt
- mtp
- multi-token-prediction
- speculative-decoding
- hybrid-attention
- mamba
- gated-deltanet
- multimodal
- aeon
- rtx-5090
- rtx-pro-6000
- b100
- b200
- dedicated-vram-blackwell
- sm_120
- sm_100
- 32gb
- conv1d-preserved
---

# Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS

> **Deployment, operations & benchmarks → [github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash](https://github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash)**
>
> The GitHub repo is the source of truth for the production deployment guide, hardware-tuned docker-compose configs, full configuration reference, measured benchmarks, and `AGENTS.md` — an operator's manual that pre-empts common stale-documentation traps.

> ## 🏆 DGX Spark performance — current production *(v3 image, 2026-04-29)*
>
> Served with **DFlash spec decode** *(not the MTP head)* on this XS body, the v3 image (`ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3`) clocks **38.5 tok/s median, 71.3 tok/s peak** thinking-on / **38.1 / 68.4** thinking-off — a **+18 % median / +26 % peak** lift over the prior v2.1 image and a **+17 % / +21 %** stacked lift vs the original `-NVFP4` (compressed-tensors) production. Median TTFT is **247 ms** (was 325 ms — −24 %). See the [GitHub Performance section](https://github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash#performance) for the four-config comparison table.

> **🙏 Reference recipe credit:** The conv1d-preserved NVFP4 + MTP graft pipeline used to build this XS variant is based on [**sakamakismile**](https://huggingface.co/sakamakismile)'s validated [Qwen3.6-27B-NVFP4-MTP series](https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP) (22K+ downloads). They worked out the modelopt config — including the strategic decision to quantize the GDN projection matmuls to NVFP4 while preserving `linear_attn.conv1d` at BF16 — and the MTP-head graft technique. We adapted the recipe to AEON-Ultimate's abliterated weights and ship both the conv1d-preserved-only XS variant (matching their footprint) and a heavier regular-MTP variant that additionally keeps the projections at BF16. Full credit for the underlying recipe → sakamakismile.

## What "XS" means — and what it's *not*

This is the **extra-small footprint** sibling of [`-Multimodal-NVFP4-MTP`](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP). XS is **not "everything to FP4."** It is a deliberate, principled split: the heavy GDN matmul projections drop to NVFP4 (where they're bandwidth-bound and FP4 wins big), while the SSM-critical `linear_attn.conv1d` kernel **stays BF16** (where FP4 has documented stability problems on long-context recurrence).

| | **Multimodal-NVFP4-MTP** (regular) | **Multimodal-NVFP4-MTP-XS** *(this repo)* |
|---|---|---|
| `linear_attn` projections (`in_proj_qkv`, `in_proj_z`, `in_proj_a/b`, `out_proj`) | preserved BF16 (~11 GB) | quantized to NVFP4 (~3 GB) |
| **`linear_attn.conv1d`** *(SSM 1D convolution — recurrence-critical)* | **preserved BF16** | **preserved BF16** ✅ |
| `linear_attn` SSM state vectors (`A_log`, `dt_bias`, `norm.weight`) | preserved BF16 | preserved BF16 ✅ |
| `mtp.*` head *(grafted bf16 from base, bit-exact verified)* | yes | yes |
| Vision tower | preserved BF16 | preserved BF16 |
| **Total disk** | **~27 GB** | **~21 GB** |
| **VRAM footprint at runtime** | ~28 GB | ~22 GB |

**This is a smart, strategic quantization — not a precision compromise.** The conv1d preservation matters: the GatedDeltaNet recurrence depends on the 1D convolution behaving numerically like its training distribution, and FP4 quantization of `conv1d` has been observed to cause drift on long-context inference in community testing. By keeping conv1d BF16 while quantizing the projections (which are bandwidth-limited matmuls where FP4 is a clean win), we get the ~6 GB footprint reduction without sacrificing the part of the model that's actually fragile under quantization. This is the same principle modelopt's `NVFP4_DEFAULT_CFG` applies by default and the same recipe sakamakismile validated across his Qwen3.6-NVFP4-MTP series (22K+ downloads).

**When to pick which:**
- **Pick the regular variant** if you have ≥48 GB VRAM. Even the *projection* weights at BF16 give a small additional safety margin on long-context recurrence stability.
- **Pick this XS variant** if you have **24–32 GB VRAM** (RTX 5090, single GPUs without headroom for full BF16 GDN). The conv1d preservation guarantees the SSM recurrence stays numerically stable; the ~6 GB savings buy meaningful KV-cache headroom on tight GPUs.

We ship both because we have the headroom on RTX PRO 6000 / B100/B200 to run the larger, more numerically-conservative version, and several users on tighter cards have asked for the smaller one. **Neither variant** quantizes `linear_attn.conv1d` — that would be a different (and not-recommended) variant we have explicitly chosen not to ship.

## Variants

| Format | Size | Use case |
|---|---|---|
| [BF16](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16) | 51 GB | Full-precision reference weights |
| [NVFP4 (compressed-tensors + DFlash)](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4) | 26 GB | DGX Spark — DFlash spec decode, validated |
| [Multimodal-NVFP4-MTP](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP) | 27 GB | RTX PRO 6000 / B100/B200 — MTP, GDN preserved BF16 |
| [Text-NVFP4-MTP](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP) | 26 GB | Same as above without vision tower |
| **Multimodal-NVFP4-MTP-XS** *(this repo)* | **21 GB** | RTX 5090 / smaller dedicated VRAM — MTP, full FP4 incl. GDN projections |
| [Text-NVFP4-MTP-XS](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP-XS) | 20 GB | Same as this repo without vision tower |

## What this is

The **modelopt-format NVFP4 + MTP variant, multimodal-preserved, with `linear_attn` projections fully quantized**, of [AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16) — the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).

Specifically:

- **Body quantized to NVFP4** via `nvidia-modelopt` 0.43.0 with `NVFP4_DEFAULT_CFG`. modelopt format, served by vLLM through `--quantization modelopt`.
- **Linear-attn / GatedDeltaNet projections quantized to NVFP4** (this is the XS difference). Only `linear_attn.conv1d` is kept BF16 (modelopt's default). The community has validated this approach on Qwen3.5/3.6-NVFP4 builds with 22K+ downloads on sakamakismile's reference recipes; we re-ran calibration on our abliterated weights and the model serves correctly.
- **Vision tower preserved BF16** (333 keys). Multimodal inference fully functional.
- **MTP head grafted from the base** `Qwen/Qwen3.6-27B` checkpoint (15 tensors, BF16, bit-exact verified). Powers `--speculative-config '{"method":"qwen3_5_mtp",...}'` for self-speculative decoding without a separate drafter.

## Why MTP

Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained `mtp.*` head, enabling **speculative decoding without a separate drafter model**. The acceptance rate is high because the drafter is the model itself — same architecture, same weights, same distribution.

Indicative published numbers (sakamakismile's reference recipe on RTX 5090):

- Single-stream short prompts at `n=3`: ~132 tok/s
- Single-stream long-form: ~105 tok/s
- 2-parallel aggregate (256K + KV FP8): ~189-207 tok/s
- Mean acceptance length: ~3.0-4.0 (compared to DFlash chains of ~2.0-2.3)

Validated benchmarks of the AEON-Ultimate XS variant land in the [GitHub repo](https://github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash#performance) once measured.

## 🎯 When to pick this variant — measured hardware routing

The right speculative-decode method depends on **memory architecture**:

| Hardware tier | Recommended variant | Why |
|---|---|---|
| **DGX Spark / GB10** *(sm_121a, unified memory)* | Either: **[`-NVFP4` (DFlash)](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-NVFP4)** *(simpler, validated)* **or this XS body served with `--speculative-config '{"method":"dflash",...}'`** *(highest measured throughput — see note below)* | Spark prefers DFlash regardless of body. The XS body **with DFlash spec** lands at **37.6 tok/s median, 68.7 tok/s peak** on Spark — the highest measured config. The grafted MTP head in this repo is *unused* in that path. **Never use `--speculative-config '{"method":"qwen3_5_mtp",...}'` on Spark** — that lands at only 24.1 tok/s median. |
| **RTX PRO 6000 Blackwell** *(96 GB dedicated VRAM)* | [Multimodal-NVFP4-MTP](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP) — GDN BF16 for best long-context fidelity, *or* **this XS variant** for ~10 % faster decode | XS measured 111.4 tok/s median vs regular's 101.5 on RTX PRO 6000. Both win against DFlash on dedicated VRAM. |
| **B100 / B200** *(sm_100, dedicated FP4)* | [Multimodal-NVFP4-MTP](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP) (preferred — GDN BF16 fits) or this XS | Native FP4 + dedicated VRAM = MTP territory. Whichever fits cleanly. |
| **RTX 5090** *(sm_120, 32 GB dedicated VRAM)* | **This XS variant** ✅ if you use vision; [Text-XS](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Text-NVFP4-MTP-XS) if text-only | XS variants fit comfortably in 32 GB; matches sakamakismile's reference footprint. |
| **A100 / H100** *(no native FP4)* | [BF16](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16) | NVFP4 dequantizes to BF16 on Ampere/Hopper — no benefit. |

Full bench numbers: [GitHub repo Performance section](https://github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash#performance).
| **A100 / H100** (no native FP4) | [BF16](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16) |

## Usage

### vLLM serve — dedicated-VRAM Blackwell (default: MTP via grafted head)

```bash
# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS \
  --local-dir ./aeon-ultimate-multimodal-nvfp4-mtp-xs

# Serve
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp-xs \
  --quantization modelopt \
  --trust-remote-code \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.94 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
```

`num_speculative_tokens=3` is the canonical setting for `qwen3_5_mtp`. Higher values diverge the drafter further from the target distribution and acceptance falls.

### vLLM serve — DGX Spark (DFlash spec, *not* MTP — measured winning config)

For DGX Spark, swap the spec method to DFlash. The XS body still benefits from FP4 silicon, but DFlash's k=15 chains are decisively better than MTP's n=3 on unified memory.

```bash
# Pull the DFlash drafter alongside this body
hf download z-lab/Qwen3.6-27B-DFlash --local-dir ./qwen36-27b-dflash

vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp-xs \
  --quantization modelopt \
  --trust-remote-code \
  --max-model-len 200000 \
  --max-num-seqs 16 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.85 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --attention-backend flash_attn \
  --speculative-config '{"method":"dflash","model":"./qwen36-27b-dflash","num_speculative_tokens":15}'
```

Production-validated v3 image: [`ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3`](https://github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash/pkgs/container/vllm-aeon-ultimate-dflash). Measured **38.1 tok/s median, 68.4 tok/s peak** thinking-off and **38.5 / 71.3** thinking-on — the highest single-stream config we've measured on Spark.

### Configuration notes

- **`--quantization modelopt`** is required for this body (not `compressed-tensors` — different format).
- **`--speculative-config '{"method":"qwen3_5_mtp", ...}'`** uses the grafted MTP head; correct for **dedicated-VRAM Blackwell**. Don't use this on DGX Spark.
- **`--speculative-config '{"method":"dflash", ...}'`** uses an external DFlash drafter; correct for **DGX Spark**. The grafted MTP head in this repo sits unused in this path (~0.85 GB dead weight). Don't use this on RTX PRO 6000 or B100/B200 — they prefer MTP.
- **`--gpu-memory-utilization 0.94`** is the validated cap on RTX PRO 6000; `0.85` is the cap on DGX Spark (unified memory thrashes higher).

## Quantization recipe

- **Tool**: `nvidia-modelopt` 0.43.0 with `NVFP4_DEFAULT_CFG`
- **Loader**: `Qwen3_5ForConditionalGeneration.from_pretrained` (multimodal-preserved class)
- **Calibration**: `neuralmagic/calibration` LLM split, 20 samples × 8192 tokens
- **Excluded from quantization (kept BF16)** — XS variant differences from the regular variant in **bold**:
  - `lm_head`, `proj_out.*`, `*router*`, `*mlp.gate.*` (NVFP4_DEFAULT_CFG)
  - **`*linear_attn.conv1d*`, `*mixer.conv1d*`** *(NVFP4_DEFAULT_CFG default — kept BF16 because FP4 quantization of the SSM 1D convolution causes drift on long-context recurrence; this is the recurrence-critical kernel of the GatedDeltaNet block. **Both regular and XS variants preserve this.**)*
  - **`*linear_attn*` is NOT broadly excluded** (XS difference — the projection matmuls `in_proj_qkv`, `in_proj_z`, `in_proj_a/b`, `out_proj` get NVFP4-quantized; saves ~8 GB; FP4 is a clean win on bandwidth-bound matmuls)
  - `*visual*` (vision tower preservation)
  - `*mtp*` (MTP head preservation)
  - `*output_layer*`, `output.*`
- **MTP graft**: 15 tensors copied bf16 from `Qwen/Qwen3.6-27B` after modelopt export
- **Pipeline**: lna-lab/GGUF-to-NVFP4-SM120 reference recipe, adapted for AEON-Ultimate-BF16 input + separate MTP source

## Provenance & credits

- **BF16 source**: [`AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16`](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16). See that card for the full abliteration pipeline.
- **MTP graft technique**: [lna-lab/GGUF-to-NVFP4-SM120](https://github.com/lna-lab/GGUF-to-NVFP4-SM120) (`docs/MTP_GRAFT_RECIPE.md`)
- **Reference benchmark recipes**: [`sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP`](https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP)
- **Quantization**: NVIDIA TensorRT Model Optimizer (`nvidia-modelopt` 0.43.0)
- **Base**: Alibaba Qwen team — `Qwen/Qwen3.6-27B`

## License + responsibility

Apache 2.0, inherited from `Qwen/Qwen3.6-27B`. **This is an uncensored model.** Read the full [User Responsibility & Arbitration Clause](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16#user-responsibility--arbitration-clause) on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own — you supply the opinions, the judgment, and the ethics.

---

## ☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

<table align="center">
  <tr>
    <td align="center" width="50%">
      <strong>₿ Bitcoin (BTC)</strong><br/>
      <img src="https://raw.githubusercontent.com/AEON-7/AEON-7/main/assets/qr/btc.png" alt="BTC QR" width="200"/><br/>
      <sub><code>bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4</code></sub>
    </td>
    <td align="center" width="50%">
      <strong>Ξ Ethereum (ETH)</strong><br/>
      <img src="https://raw.githubusercontent.com/AEON-7/AEON-7/main/assets/qr/eth.png" alt="ETH QR" width="200"/><br/>
      <sub><code>0x1512667F6D61454ad531d2E45C0a5d1fd82D0500</code></sub>
    </td>
  </tr>
  <tr>
    <td align="center" width="50%">
      <strong>◎ Solana (SOL)</strong><br/>
      <img src="https://raw.githubusercontent.com/AEON-7/AEON-7/main/assets/qr/sol.png" alt="SOL QR" width="200"/><br/>
      <sub><code>DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t</code></sub>
    </td>
    <td align="center" width="50%">
      <strong>ⓜ Monero (XMR)</strong><br/>
      <img src="https://raw.githubusercontent.com/AEON-7/AEON-7/main/assets/qr/xmr.png" alt="XMR QR" width="200"/><br/>
      <sub><code>836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd</code></sub>
    </td>
  </tr>
</table>

> **Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens** can be sent to the same Ethereum address.