prince-canuma commited on
Commit
83f9511
·
verified ·
1 Parent(s): 570264b

Add MTP drafter model card

Browse files
Files changed (1) hide show
  1. README.md +57 -7
README.md CHANGED
@@ -1,19 +1,69 @@
1
  ---
2
- language: en
 
 
 
 
 
 
3
  tags:
4
  - mlx
5
- library_name: mlx
6
- pipeline_tag: image-text-to-text
 
 
 
 
 
 
 
 
7
  ---
8
 
9
- # mlx-community/gemma-4-12B-it-qat-assistant-nvfp4
 
 
 
 
10
 
11
- ## Use with mlx
12
 
13
  ```bash
14
- pip install -U mlx-vlm
 
 
 
 
 
15
  ```
16
 
 
 
17
  ```bash
18
- python -m mlx_vlm.generate --model mlx-community/gemma-4-12B-it-qat-assistant-nvfp4 --max-tokens 100 --temperature 0.0 --prompt "Describe this image." --image <path_to_image>
 
 
 
 
 
19
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ license: gemma
5
+ base_model:
6
+ - google/gemma-4-12B-it-qat-q4_0-unquantized-assistant
7
+ library_name: mlx
8
+ pipeline_tag: text-generation
9
  tags:
10
  - mlx
11
+ - mlx-vlm
12
+ - gemma4_unified_assistant
13
+ - gemma-4
14
+ - gemma-4-12b
15
+ - gemma
16
+ - mtp
17
+ - speculative-decoding
18
+ - draft-model
19
+ - nvfp4
20
+ inference: false
21
  ---
22
 
23
+ # gemma-4-12B-it-qat-assistant-nvfp4
24
+
25
+ This repository contains Multi-Token Prediction (MTP) drafter weights split from `google/gemma-4-12B-it-qat-q4_0-unquantized-assistant` for use with `mlx-vlm` speculative decoding.
26
+
27
+ This is not a standalone chat or text-generation model. Load it as the draft model alongside a compatible Gemma 4 12B target checkpoint.
28
 
29
+ ## Use with mlx-vlm
30
 
31
  ```bash
32
+ uv run mlx_vlm.generate \
33
+ --model google/gemma-4-12B-it \
34
+ --draft-model mlx-community/gemma-4-12B-it-qat-assistant-nvfp4 \
35
+ --draft-kind mtp \
36
+ --prompt "Describe this image." \
37
+ --max-tokens 256
38
  ```
39
 
40
+ For local weights:
41
+
42
  ```bash
43
+ uv run mlx_vlm.generate \
44
+ --model /path/to/target-model \
45
+ --draft-model /path/to/gemma-4-12B-mtp \
46
+ --draft-kind mtp \
47
+ --prompt "Describe this image." \
48
+ --max-tokens 256
49
  ```
50
+
51
+ ## Model Details
52
+
53
+ - Model type: `gemma4_unified_assistant`
54
+ - Target architecture: Gemma 4 12B
55
+ - Precision: nvfp4
56
+ - Runtime: MLX / `mlx-vlm`
57
+ - Format: Safetensors with MLX-compatible config and tokenizer files
58
+
59
+ The stored tensors are nvfp4 MLX-compatible drafter weights.
60
+
61
+ ## Intended Use
62
+
63
+ Use this repo only as a speculative decoding drafter for compatible Gemma 4 12B checkpoints. The target model verifies drafted tokens, while this MTP model proposes candidate tokens per decoding step.
64
+
65
+ ## Limitations
66
+
67
+ This checkpoint requires runtime support for Gemma 4 MTP draft models in `mlx-vlm`. Standard standalone generation through generic Transformers APIs is not expected to work with this repository by itself.
68
+
69
+ Please refer to the upstream `google/gemma-4-12B-it` model card and license terms for model usage constraints.