Instructions to use manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32 with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32") config = load_config("manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32
Run Hermes
hermes
diffusiongemma-26B-A4B-it-mini-g32 — for 16 GB Macs
9.79 GB TurboQuant build of google/diffusiongemma-26B-A4B-it, sized to run on a 16 GB Apple Silicon Mac (Mac mini class), produced with TurboQuant-MLX. Sibling of the higher-fidelity tq3-g32 build (13.8 GB — needs ~18 GB peak, OOMs on 16 GB machines).
Precision layout
| Component | Precision |
|---|---|
| MoE experts, layers 0,1,2,27,28,29 (protected) | tq3, group 32 |
| MoE experts, layers 3–26 | tq2, group 32 |
| Attention q/k/v/o | tq3, group 32 |
| Embeddings, dense per-layer MLP, vision tower | 8-bit affine (g64) |
| Routers, self-conditioning, norms | bf16 |
Why the protection: raw 2-bit experts break arithmetic on this model (17×23 → "3"). Keeping the first/last three layers' experts at 3-bit restores it for +0.2 GB — measured: 17×23 = 391 ✓, multi-step chains correct, in-context recall exact, prose with only minor artifacts.
Measured (Apple Silicon)
| this repo (mini) | tq3-g32 | mlx-community 4-bit | |
|---|---|---|---|
| Size on disk | 9.79 GB | 13.8 GB | 15.0 GB |
Peak memory (--max-tokens 120) |
~12.4 GB | ~18 GB | ~19 GB |
| Math / recall probes | pass | pass | pass |
Verified on a 16 GB Mac mini (after the wired-limit bump below):
96 tokens at ~1.5 tok/s, 12.9 GB peak, coherent output. Speed knob:
--max-denoising-steps 24 is ~2x faster at a mild quality cost.
On a 16 GB Mac the default Metal working-set limit is 12.1 GB and the mini
build peaks just above it (12.4 GB at --max-tokens 120), so raise the
Metal wired limit first (verified working on a 16 GB Mac mini; resets on
reboot):
sudo sysctl iogpu.wired_limit_mb=13824
Alternatively keep the canvas (and its activations) smaller with
--max-tokens 96.
Requirements
pip install "turboquant-mlx-full[vlm]>=0.8.0"
Quick Start
python -m turboquant_mlx.generate_vlm \
--model manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32 \
--prompt "Write a short paragraph about the ocean." \
--max-tokens 120 --temp 0.0
Optional speed knob: --max-denoising-steps 24 (~2x faster, mild quality
cost — quantized diffusion needs more denoising iterations than bf16).
Reproducing the conversion
python -m turboquant_mlx.convert_vlm \
--hf-path google/diffusiongemma-26B-A4B-it \
--mlx-path ./diffusiongemma-26B-A4B-it-mini-g32 \
--bits 2 --attn-bits 3 -g 32 \
--protect-expert-layers 0,1,2,27,28,29 --protect-bits 3 \
--quantize-extras
License
Apache 2.0, subject to the Gemma license terms (same as the base model). Quantization tooling: TurboQuant-MLX.
Copyright 2026 Manjunath Janardhan.
Citation
@article{zandieh2025turboquant,
title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
year={2025},
eprint={2504.19874},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.19874}
}
- Downloads last month
- 1,013
Quantized
Model tree for manjunathshiva/diffusiongemma-26B-A4B-it-mini-g32
Base model
google/diffusiongemma-26B-A4B-it