--- license: apache-2.0 library_name: mlx base_model: Qwen/Qwen3.6-27B base_model_relation: quantized pipeline_tag: image-text-to-text tags: - qwen - qwen3_5 - dense - hybrid-attention - gated-deltanet - mlx - apple-silicon - quantized - mxfp8 - vision - multimodal - video - multi-token-prediction - speculative-decoding - jang - osaurus quantization_config: family: mxfp8 profile: MXFP8 group_size: 32 bits: 8 ---

OsaurusAI

# Qwen3.6-27B-MXFP8-MTP **Qwen3.6-27B** (dense) quantized to native MXFP8 for Apple Silicon, with the vision tower and the native Multi-Token-Prediction head preserved and enabled. | | | |---|---| | Source | [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) | | License | Apache-2.0, inherited from upstream | | Format | MXFP8 (`mx.quantize`, affine, `group_size=32`) | | Architecture | `qwen3_5` dense — 64 layers, hybrid GatedDeltaNet + full attention, hidden 5120 | | Modality | image + video + text | | Context | 262,144 | | Bundle size | 27.1 GB | | MTP | native head preserved, **enabled** (`num_nextn_predict_layers=1`) | ## Quantization 8-bit affine linears via MLX-native `mx.quantize` (`mode="mxfp8"`, `group_size=32`). Norms, hybrid-attention control tensors and the full vision tower are kept in fp16 passthrough (693 passthrough tensors). MTP linears are quantized to MXFP8; MTP norm/control tensors stay fp16. ## Multi-Token Prediction This bundle keeps Qwen3.6's native MTP module and runs it as a **self-speculative draft head**: the MTP head proposes tokens that the main model verifies in a single pass, so decoded output stays **bit-identical to plain autoregressive decoding** — only faster. Recorded on an **M5 Max** (vMLX runtime, 96-token deterministic prompt, output verified equal to baseline at every depth): | Draft depth | tok/s | Speedup | |---|---|---| | Baseline (MTP off) | 15.8 | 1.00× | | D1 | 24.7 | 1.56× | | D2 | 28.8 | 1.82× | | **D3 (default)** | **28.9** | **1.83×** | With vMLX prefix/KV cache layers enabled the speedup holds — a recorded cache-on A/B measured 15.5 → 28.2 tok/s (1.81×). > Absolute tok/s depends on free memory and system load. The **speedup > ratio** — baseline vs. MTP measured back-to-back under identical > conditions — is the stable figure. ## Vision, MTP and caching together These bundles run image/video input, native MTP speculative decode and prefix/KV caching in the same session — a combination not every MTP-enabled Qwen build exposes. A recorded VL probe (2026-05-16) confirms a color identification image prompt returns the correct answer through the combined MTP + VL runtime. ## Loading Loads via stock MLX tooling on Apple Silicon — the `mxfp8` weights are native `mx.quantize` affine, no JANG runtime required for the core model. ```python from mlx_vlm import load, generate model, processor = load("OsaurusAI/Qwen3.6-27B-MXFP8-MTP") ``` The MTP draft path is exercised by an MTP-aware runtime (vMLX); other runtimes load and decode the main model normally and ignore the MTP head. ## Variants | Variant | Arch | Format | Size | Best MTP speedup | |---|---|---|---|---| | [Qwen3.6-27B-MXFP4-MTP](https://huggingface.co/OsaurusAI/Qwen3.6-27B-MXFP4-MTP) | dense | mxfp4 | 14.4 GB | 1.85× (D2) | | **Qwen3.6-27B-MXFP8-MTP** (this) | dense | mxfp8 | 27.1 GB | **1.83× (D3)** | | [Qwen3.6-35B-A3B-MXFP4-MTP](https://huggingface.co/OsaurusAI/Qwen3.6-35B-A3B-MXFP4-MTP) | MoE | mxfp4 | 21.5 GB | 1.56× (D3) | | [Qwen3.6-35B-A3B-MXFP8-MTP](https://huggingface.co/OsaurusAI/Qwen3.6-35B-A3B-MXFP8-MTP) | MoE | mxfp8 | 35.0 GB | 1.71× (D3) | ## Credits - **Quantization toolchain:** [JANG](https://github.com/jangq-ai/jang) by Jinho Jang <eric@osaurus.ai> - **Base model:** Qwen3.6-27B by [Qwen](https://huggingface.co/Qwen)