gemma-4-26B-A4B-it-oQ8

oQ8 MLX conversion of google/gemma-4-26B-A4B-it, converted with oMLX 0.3.11.

Source model: google/gemma-4-26B-A4B-it
Conversion tool: oMLX 0.3.11
Format: MLX safetensors
Quantization: oQ8
Recommended runtime: mlx-vlm >= 0.5.0 or recent oMLX

This is not a fine-tune, merge, or retraining result. The source model is quantized/exported for MLX.

Why this conversion

This upload was made after the recent Gemma 4 support work in mlx-vlm 0.5.0 and oMLX 0.3.9 / 0.3.10 / 0.3.11.

The main artifact-level reason is VLM packaging. oMLX 0.3.10 fixed an oQ VLM issue where processor_config.json was not copied into the quantized output, which could make an image-text model load through a text-only path.

This repository was converted with oMLX 0.3.11 after that fix. It includes:

processor_config.json
chat_template.jinja
config.json
generation_config.json
tokenizer.json
tokenizer_config.json
model.safetensors.index.json
6 safetensor shards

mlx-vlm 0.5.0 is also relevant because it includes Gemma 4 quantized per-layer projection loading and several Gemma 4 VLM/runtime fixes.

Runtime features such as batching, cache reuse, DFlash, MTP, streamed output, and tool-call parsing still depend on the installed runtime version. Downloading this repository does not replace upgrading mlx-vlm or oMLX.

When to use this

Use this artifact if you want an oQ8 Gemma 4 26B A4B MLX conversion made with oMLX 0.3.11, with VLM processor metadata included.

Older conversions can still work if their processor/config files are complete and they are used with a current Gemma 4-aware runtime.

Model notes

Gemma 4 26B A4B is a sparse Mixture-of-Experts model with approximately 25.2B total parameters, about 3.8B active parameters per token, a 256K context window, and text+image input support.

Audio support belongs to the smaller Gemma 4 variants, not the 26B A4B model.

Usage

pip install -U mlx-vlm huggingface_hub[hf_xet]

Text:

mlx_vlm.generate \
  --model QwQbb/gemma-4-26B-A4B-it-oQ8 \
  --max-tokens 512 \
  --temperature 1.0 \
  --prompt "Explain how MoE routing affects inference cost."

Image:

mlx_vlm.generate \
  --model QwQbb/gemma-4-26B-A4B-it-oQ8 \
  --image /path/to/image.png \
  --max-tokens 512 \
  --temperature 1.0 \
  --prompt "Describe this image in detail."

Limitations

Quantized MLX export of the original Google model.
oQ8 prioritizes quality over minimum size.
Quantization can introduce small differences from the source model.
Runtime behavior depends on the installed mlx-vlm / oMLX version.

References

Original model: https://huggingface.co/google/gemma-4-26B-A4B-it
mlx-vlm 0.5.0: https://github.com/Blaizzy/mlx-vlm/releases/tag/v0.5.0
oMLX releases: https://github.com/jundot/omlx/releases
oQ documentation: https://github.com/jundot/omlx/blob/main/docs/oQ_Quantization.md

Downloads last month: 313

Safetensors

Model size

8B params

Tensor type

BF16

U32

MLX

Hardware compatibility

8-bit

Model tree for QwQbb/gemma-4-26B-A4B-it-oQ8

Base model

google/gemma-4-26B-A4B

Finetuned

google/gemma-4-26B-A4B-it

Quantized

(234)

this model