How to use from the
Use from the
MLX library
# Make sure mlx-vlm is installed
# pip install --upgrade mlx-vlm

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model
model, processor = load("QwQbb/gemma-4-26B-A4B-it-oQ8")
config = load_config("QwQbb/gemma-4-26B-A4B-it-oQ8")

# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=1
)

# Generate output
output = generate(model, processor, formatted_prompt, image)
print(output)

gemma-4-26B-A4B-it-oQ8

oQ8 MLX conversion of google/gemma-4-26B-A4B-it, converted with oMLX 0.3.11.

  • Source model: google/gemma-4-26B-A4B-it
  • Conversion tool: oMLX 0.3.11
  • Format: MLX safetensors
  • Quantization: oQ8
  • Recommended runtime: mlx-vlm >= 0.5.0 or recent oMLX

This is not a fine-tune, merge, or retraining result. The source model is quantized/exported for MLX.

Why this conversion

This upload was made after the recent Gemma 4 support work in mlx-vlm 0.5.0 and oMLX 0.3.9 / 0.3.10 / 0.3.11.

The main artifact-level reason is VLM packaging. oMLX 0.3.10 fixed an oQ VLM issue where processor_config.json was not copied into the quantized output, which could make an image-text model load through a text-only path.

This repository was converted with oMLX 0.3.11 after that fix. It includes:

  • processor_config.json
  • chat_template.jinja
  • config.json
  • generation_config.json
  • tokenizer.json
  • tokenizer_config.json
  • model.safetensors.index.json
  • 6 safetensor shards

mlx-vlm 0.5.0 is also relevant because it includes Gemma 4 quantized per-layer projection loading and several Gemma 4 VLM/runtime fixes.

Runtime features such as batching, cache reuse, DFlash, MTP, streamed output, and tool-call parsing still depend on the installed runtime version. Downloading this repository does not replace upgrading mlx-vlm or oMLX.

When to use this

Use this artifact if you want an oQ8 Gemma 4 26B A4B MLX conversion made with oMLX 0.3.11, with VLM processor metadata included.

Older conversions can still work if their processor/config files are complete and they are used with a current Gemma 4-aware runtime.

Model notes

Gemma 4 26B A4B is a sparse Mixture-of-Experts model with approximately 25.2B total parameters, about 3.8B active parameters per token, a 256K context window, and text+image input support.

Audio support belongs to the smaller Gemma 4 variants, not the 26B A4B model.

Usage

pip install -U mlx-vlm huggingface_hub[hf_xet]

Text:

mlx_vlm.generate \
  --model QwQbb/gemma-4-26B-A4B-it-oQ8 \
  --max-tokens 512 \
  --temperature 1.0 \
  --prompt "Explain how MoE routing affects inference cost."

Image:

mlx_vlm.generate \
  --model QwQbb/gemma-4-26B-A4B-it-oQ8 \
  --image /path/to/image.png \
  --max-tokens 512 \
  --temperature 1.0 \
  --prompt "Describe this image in detail."

Limitations

  • Quantized MLX export of the original Google model.
  • oQ8 prioritizes quality over minimum size.
  • Quantization can introduce small differences from the source model.
  • Runtime behavior depends on the installed mlx-vlm / oMLX version.

References

Downloads last month
313
Safetensors
Model size
8B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for QwQbb/gemma-4-26B-A4B-it-oQ8

Quantized
(234)
this model