---
language:
- en
license: apache-2.0
base_model:
- Qwen/Qwen3.5-9B
library_name: mlx
pipeline_tag: text-generation
tags:
- mlx
- mlx-vlm
- qwen3_5_mtp
- qwen3.5
- qwen3.5-9b
- qwen
- mtp
- speculative-decoding
- draft-model
- 4-bit
inference: false
---

# Qwen3.5-9B-MTP-4bit

This repository contains Multi-Token Prediction (MTP) drafter weights split from `Qwen/Qwen3.5-9B` for use with `mlx-vlm` speculative decoding.

This is not a standalone chat or text-generation model. Load it as the draft model alongside a compatible Qwen3.5 9B target checkpoint.

## Use with mlx-vlm

```bash
uv run mlx_vlm.generate \
  --model mlx-community/Qwen3.5-9B-5bit \
  --draft-model mlx-community/Qwen3.5-9B-MTP-4bit \
  --prompt "Hi, how are you?" \
  --max-tokens 256 \
  --enable-thinking
```

For local weights:

```bash
uv run mlx_vlm.generate \
  --model /path/to/target-model \
  --draft-model /path/to/Qwen3.5-9B-mtp-4bit \
  --prompt "Hi, how are you?" \
  --max-tokens 256 \
  --enable-thinking
```

## Model Details

- Model type: `qwen3_5_mtp`
- MTP block size: `2`
- Target architecture: Qwen3.5 9B
- Precision: MLX affine 4-bit, group size 64
- Runtime: MLX / `mlx-vlm`
- Format: Safetensors with MLX-compatible config and tokenizer files

The stored tensors use MLX affine 4-bit quantization as described in `config.json`.

## Intended Use

Use this repo only as a speculative decoding drafter for compatible Qwen3.5 9B checkpoints. The target model verifies drafted tokens, while this MTP model proposes candidate tokens per decoding step.

## Limitations

This checkpoint requires runtime support for Qwen/DeepSeek MTP draft models in `mlx-vlm`. Standard standalone generation through generic Transformers APIs is not expected to work with this repository by itself.

Please refer to the upstream `Qwen/Qwen3.5-9B` model card and license terms for model usage constraints.