---
license: apache-2.0
library_name: mlx
base_model: Qwen/Qwen3.6-27B
base_model_relation: quantized
pipeline_tag: image-text-to-text
tags:
  - qwen
  - qwen3_5
  - dense
  - hybrid-attention
  - gated-deltanet
  - mlx
  - apple-silicon
  - quantized
  - mxfp8
  - vision
  - multimodal
  - video
  - multi-token-prediction
  - speculative-decoding
  - jang
  - osaurus
quantization_config:
  family: mxfp8
  profile: MXFP8
  group_size: 32
  bits: 8
---

<p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p>

# Qwen3.6-27B-MXFP8-MTP

**Qwen3.6-27B** (dense) quantized to native MXFP8 for Apple Silicon, with the
vision tower and the native Multi-Token-Prediction head preserved and enabled.

| | |
|---|---|
| Source | [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) |
| License | Apache-2.0, inherited from upstream |
| Format | MXFP8 (`mx.quantize`, affine, `group_size=32`) |
| Architecture | `qwen3_5` dense — 64 layers, hybrid GatedDeltaNet + full attention, hidden 5120 |
| Modality | image + video + text |
| Context | 262,144 |
| Bundle size | 27.1 GB |
| MTP | native head preserved, **enabled** (`num_nextn_predict_layers=1`) |

## Quantization

8-bit affine linears via MLX-native `mx.quantize` (`mode="mxfp8"`,
`group_size=32`). Norms, hybrid-attention control tensors and the full
vision tower are kept in fp16 passthrough (693 passthrough tensors). MTP
linears are quantized to MXFP8; MTP norm/control tensors stay fp16.

## Multi-Token Prediction

This bundle keeps Qwen3.6's native MTP module and runs it as a
**self-speculative draft head**: the MTP head proposes tokens that the main
model verifies in a single pass, so decoded output stays **bit-identical to
plain autoregressive decoding** — only faster.

Recorded on an **M5 Max** (vMLX runtime, 96-token deterministic prompt,
output verified equal to baseline at every depth):

| Draft depth | tok/s | Speedup |
|---|---|---|
| Baseline (MTP off) | 15.8 | 1.00× |
| D1 | 24.7 | 1.56× |
| D2 | 28.8 | 1.82× |
| **D3 (default)** | **28.9** | **1.83×** |

With vMLX prefix/KV cache layers enabled the speedup holds — a recorded
cache-on A/B measured 15.5 → 28.2 tok/s (1.81×).

> Absolute tok/s depends on free memory and system load. The **speedup
> ratio** — baseline vs. MTP measured back-to-back under identical
> conditions — is the stable figure.

## Vision, MTP and caching together

These bundles run image/video input, native MTP speculative decode and
prefix/KV caching in the same session — a combination not every MTP-enabled
Qwen build exposes. A recorded VL probe (2026-05-16) confirms a color
identification image prompt returns the correct answer through the combined
MTP + VL runtime.

## Loading

Loads via stock MLX tooling on Apple Silicon — the `mxfp8` weights are
native `mx.quantize` affine, no JANG runtime required for the core model.

```python
from mlx_vlm import load, generate
model, processor = load("OsaurusAI/Qwen3.6-27B-MXFP8-MTP")
```

The MTP draft path is exercised by an MTP-aware runtime (vMLX); other
runtimes load and decode the main model normally and ignore the MTP head.

## Variants

| Variant | Arch | Format | Size | Best MTP speedup |
|---|---|---|---|---|
| [Qwen3.6-27B-MXFP4-MTP](https://huggingface.co/OsaurusAI/Qwen3.6-27B-MXFP4-MTP) | dense | mxfp4 | 14.4 GB | 1.85× (D2) |
| **Qwen3.6-27B-MXFP8-MTP** (this) | dense | mxfp8 | 27.1 GB | **1.83× (D3)** |
| [Qwen3.6-35B-A3B-MXFP4-MTP](https://huggingface.co/OsaurusAI/Qwen3.6-35B-A3B-MXFP4-MTP) | MoE | mxfp4 | 21.5 GB | 1.56× (D3) |
| [Qwen3.6-35B-A3B-MXFP8-MTP](https://huggingface.co/OsaurusAI/Qwen3.6-35B-A3B-MXFP8-MTP) | MoE | mxfp8 | 35.0 GB | 1.71× (D3) |

## Credits

- **Quantization toolchain:** [JANG](https://github.com/jangq-ai/jang) by Jinho Jang &lt;eric@osaurus.ai&gt;
- **Base model:** Qwen3.6-27B by [Qwen](https://huggingface.co/Qwen)