---
license: apache-2.0
library_name: mlx
base_model: Qwen/Qwen3.6-27B
base_model_relation: quantized
pipeline_tag: image-text-to-text
tags:
- qwen
- qwen3_5
- dense
- hybrid-attention
- gated-deltanet
- mlx
- apple-silicon
- quantized
- mxfp8
- vision
- multimodal
- video
- multi-token-prediction
- speculative-decoding
- jang
- osaurus
quantization_config:
family: mxfp8
profile: MXFP8
group_size: 32
bits: 8
---

# Qwen3.6-27B-MXFP8-MTP
**Qwen3.6-27B** (dense) quantized to native MXFP8 for Apple Silicon, with the
vision tower and the native Multi-Token-Prediction head preserved and enabled.
| | |
|---|---|
| Source | [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) |
| License | Apache-2.0, inherited from upstream |
| Format | MXFP8 (`mx.quantize`, affine, `group_size=32`) |
| Architecture | `qwen3_5` dense — 64 layers, hybrid GatedDeltaNet + full attention, hidden 5120 |
| Modality | image + video + text |
| Context | 262,144 |
| Bundle size | 27.1 GB |
| MTP | native head preserved, **enabled** (`num_nextn_predict_layers=1`) |
## Quantization
8-bit affine linears via MLX-native `mx.quantize` (`mode="mxfp8"`,
`group_size=32`). Norms, hybrid-attention control tensors and the full
vision tower are kept in fp16 passthrough (693 passthrough tensors). MTP
linears are quantized to MXFP8; MTP norm/control tensors stay fp16.
## Multi-Token Prediction
This bundle keeps Qwen3.6's native MTP module and runs it as a
**self-speculative draft head**: the MTP head proposes tokens that the main
model verifies in a single pass, so decoded output stays **bit-identical to
plain autoregressive decoding** — only faster.
Recorded on an **M5 Max** (vMLX runtime, 96-token deterministic prompt,
output verified equal to baseline at every depth):
| Draft depth | tok/s | Speedup |
|---|---|---|
| Baseline (MTP off) | 15.8 | 1.00× |
| D1 | 24.7 | 1.56× |
| D2 | 28.8 | 1.82× |
| **D3 (default)** | **28.9** | **1.83×** |
With vMLX prefix/KV cache layers enabled the speedup holds — a recorded
cache-on A/B measured 15.5 → 28.2 tok/s (1.81×).
> Absolute tok/s depends on free memory and system load. The **speedup
> ratio** — baseline vs. MTP measured back-to-back under identical
> conditions — is the stable figure.
## Vision, MTP and caching together
These bundles run image/video input, native MTP speculative decode and
prefix/KV caching in the same session — a combination not every MTP-enabled
Qwen build exposes. A recorded VL probe (2026-05-16) confirms a color
identification image prompt returns the correct answer through the combined
MTP + VL runtime.
## Loading
Loads via stock MLX tooling on Apple Silicon — the `mxfp8` weights are
native `mx.quantize` affine, no JANG runtime required for the core model.
```python
from mlx_vlm import load, generate
model, processor = load("OsaurusAI/Qwen3.6-27B-MXFP8-MTP")
```
The MTP draft path is exercised by an MTP-aware runtime (vMLX); other
runtimes load and decode the main model normally and ignore the MTP head.
## Variants
| Variant | Arch | Format | Size | Best MTP speedup |
|---|---|---|---|---|
| [Qwen3.6-27B-MXFP4-MTP](https://huggingface.co/OsaurusAI/Qwen3.6-27B-MXFP4-MTP) | dense | mxfp4 | 14.4 GB | 1.85× (D2) |
| **Qwen3.6-27B-MXFP8-MTP** (this) | dense | mxfp8 | 27.1 GB | **1.83× (D3)** |
| [Qwen3.6-35B-A3B-MXFP4-MTP](https://huggingface.co/OsaurusAI/Qwen3.6-35B-A3B-MXFP4-MTP) | MoE | mxfp4 | 21.5 GB | 1.56× (D3) |
| [Qwen3.6-35B-A3B-MXFP8-MTP](https://huggingface.co/OsaurusAI/Qwen3.6-35B-A3B-MXFP8-MTP) | MoE | mxfp8 | 35.0 GB | 1.71× (D3) |
## Credits
- **Quantization toolchain:** [JANG](https://github.com/jangq-ai/jang) by Jinho Jang <eric@osaurus.ai>
- **Base model:** Qwen3.6-27B by [Qwen](https://huggingface.co/Qwen)