--- language: - en license: gemma base_model: - google/gemma-4-E4B-it-qat-q4_0-unquantized-assistant library_name: mlx pipeline_tag: text-generation tags: - mlx - mlx-vlm - gemma4_assistant - gemma-4 - gemma-4-e4b - gemma - mtp - speculative-decoding - draft-model - 4bit inference: false --- # gemma-4-E4B-it-qat-assistant-4bit This repository contains Multi-Token Prediction (MTP) drafter weights split from `google/gemma-4-E4B-it-qat-q4_0-unquantized-assistant` for use with `mlx-vlm` speculative decoding. This is not a standalone chat or text-generation model. Load it as the draft model alongside a compatible Gemma 4 E4B target checkpoint. ## Use with mlx-vlm ```bash uv run mlx_vlm.generate \ --model google/gemma-4-E4B-it \ --draft-model mlx-community/gemma-4-E4B-it-qat-assistant-4bit \ --draft-kind mtp \ --prompt "Describe this image." \ --max-tokens 256 ``` For local weights: ```bash uv run mlx_vlm.generate \ --model /path/to/target-model \ --draft-model /path/to/gemma-4-E4B-mtp \ --draft-kind mtp \ --prompt "Describe this image." \ --max-tokens 256 ``` ## Model Details - Model type: `gemma4_assistant` - Target architecture: Gemma 4 E4B - Precision: 4bit - Runtime: MLX / `mlx-vlm` - Format: Safetensors with MLX-compatible config and tokenizer files The stored tensors are 4bit MLX-compatible drafter weights. ## Intended Use Use this repo only as a speculative decoding drafter for compatible Gemma 4 E4B checkpoints. The target model verifies drafted tokens, while this MTP model proposes candidate tokens per decoding step. ## Limitations This checkpoint requires runtime support for Gemma 4 MTP draft models in `mlx-vlm`. Standard standalone generation through generic Transformers APIs is not expected to work with this repository by itself. Please refer to the upstream `google/gemma-4-E4B-it` model card and license terms for model usage constraints.