---
language:
- en
license: gemma
base_model:
- google/gemma-4-E4B-it-qat-q4_0-unquantized-assistant
library_name: mlx
pipeline_tag: text-generation
tags:
- mlx
- mlx-vlm
- gemma4_assistant
- gemma-4
- gemma-4-e4b
- gemma
- mtp
- speculative-decoding
- draft-model
- 4bit
inference: false
---

# gemma-4-E4B-it-qat-assistant-4bit

This repository contains Multi-Token Prediction (MTP) drafter weights split from `google/gemma-4-E4B-it-qat-q4_0-unquantized-assistant` for use with `mlx-vlm` speculative decoding.

This is not a standalone chat or text-generation model. Load it as the draft model alongside a compatible Gemma 4 E4B target checkpoint.

## Use with mlx-vlm

```bash
uv run mlx_vlm.generate \
  --model google/gemma-4-E4B-it \
  --draft-model mlx-community/gemma-4-E4B-it-qat-assistant-4bit \
  --draft-kind mtp \
  --prompt "Describe this image." \
  --max-tokens 256
```

For local weights:

```bash
uv run mlx_vlm.generate \
  --model /path/to/target-model \
  --draft-model /path/to/gemma-4-E4B-mtp \
  --draft-kind mtp \
  --prompt "Describe this image." \
  --max-tokens 256
```

## Model Details

- Model type: `gemma4_assistant`
- Target architecture: Gemma 4 E4B
- Precision: 4bit
- Runtime: MLX / `mlx-vlm`
- Format: Safetensors with MLX-compatible config and tokenizer files

The stored tensors are 4bit MLX-compatible drafter weights.

## Intended Use

Use this repo only as a speculative decoding drafter for compatible Gemma 4 E4B checkpoints. The target model verifies drafted tokens, while this MTP model proposes candidate tokens per decoding step.

## Limitations

This checkpoint requires runtime support for Gemma 4 MTP draft models in `mlx-vlm`. Standard standalone generation through generic Transformers APIs is not expected to work with this repository by itself.

Please refer to the upstream `google/gemma-4-E4B-it` model card and license terms for model usage constraints.