Instructions to use GazTrab/Holo-3.1-35B-A3B-oQ8-mtp with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use GazTrab/Holo-3.1-35B-A3B-oQ8-mtp with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("GazTrab/Holo-3.1-35B-A3B-oQ8-mtp") config = load_config("GazTrab/Holo-3.1-35B-A3B-oQ8-mtp") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use GazTrab/Holo-3.1-35B-A3B-oQ8-mtp with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "GazTrab/Holo-3.1-35B-A3B-oQ8-mtp"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "GazTrab/Holo-3.1-35B-A3B-oQ8-mtp" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use GazTrab/Holo-3.1-35B-A3B-oQ8-mtp with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "GazTrab/Holo-3.1-35B-A3B-oQ8-mtp"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default GazTrab/Holo-3.1-35B-A3B-oQ8-mtp
Run Hermes
hermes
Holo-3.1-35B-A3B-oQ8-mtp
8-bit (oMLX oQ8) MLX build of Hcompany/Holo-3.1-35B-A3B — H Company's GUI-grounding / computer-use VLM — with a Multi-Token-Prediction (MTP / nextn) head grafted in so oMLX can run native speculative decoding.
It is symrex/Holo-3.1-35B-A3B-oQ8 (which dropped the MTP head) with the 8-bit MTP module transplanted from tfjack/Qwen3.6-35B-A3B-oQ8-fp16-mtp — the same Qwen3.6-35B-A3B base Holo-3.1 was fine-tuned from.
📦 Code, scripts, and full write-up: https://github.com/gaztrabisme/holo-3.1-a3b-mtp-mlx
Why a graft?
No Holo-3.1 release ships an MTP head — H Company stripped the nextn layer when fine-tuning (verified across bf16, oQ8, GGUF, NVFP4). The MTP head therefore can only come from the Qwen3.6-35B-A3B base. Both models share a byte-identical non-MTP tensor set (2010 tensors, model_type: qwen3_5_moe, hidden 2048, 40 layers, 256 experts, vocab 248320), so the 42-tensor language_model.mtp.* block (already 8-bit, group_size matched) drops in with no transforms. Config gains text_config.mtp_num_hidden_layers=1, mtp_use_dedicated_embeddings=False, and 6 MTP quant-overrides.
Verified on M3 Max (96 GB), single-stream, greedy
| base oQ8 | this (oQ8-mtp) | |
|---|---|---|
| Decode (300 tok) | 55.0 tok/s | 60.5 tok/s (~1.1×) |
| MTP acceptance (structured text) | — | 94.8% (145/153) |
| Output vs base | — | byte-identical (lossless) |
Lossless by construction: speculative decoding verifies every drafted token against the main model, so output is exactly Holo's greedy output — confirmed byte-identical on text (sha1 match) and on all 12 grounding targets below. The grafted head only affects speed, never correctness.
Grounding spot-check (synthetic 1280×720 UI, 12 known targets)
12/12 hit, median 6 px error, max 55 px. Base and MTP returned identical coordinates on all 12. (MTP acceptance on grounding is lower, 33–60%, because outputs are tiny ~15-token {x,y} blobs; for pure grounding, screenshot prefill dominates latency and MTP barely moves it — it pays off on longer agentic/navigation traces.)
Green rings = ground truth, red crosshairs = Holo's predicted click (greedy, temperature=0):
Usage (oMLX, grounding)
Holo-3.1 grounding = single user turn, normalized [0,1000] JSON {x,y} (not Holo1's pixel Click(x,y)). Use greedy + thinking off; greedy is both the correct setting for stable coordinates and what maximizes MTP acceptance.
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="...")
W, H = 1280, 720 # send a smart_resize'd image; scale coords against THESE dims
prompt = (
"Localize an element on the GUI image according to the provided target "
"and output a click position.\n"
" * You must output a valid JSON following the format: "
'{"properties":{"x":{"type":"integer"},"y":{"type":"integer"}},"required":["x","y"]}\n'
" Your target is:\nthe 'Save Changes' button")
r = client.chat.completions.create(
model="Holo-3.1-35B-A3B-oQ8-mtp", temperature=0,
messages=[{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
{"type": "text", "text": prompt}]}],
extra_body={
"chat_template_kwargs": {"enable_thinking": False},
"response_format": {"type": "json_schema", "json_schema": {"name": "point",
"schema": {"type": "object", "required": ["x", "y"], "additionalProperties": False,
"properties": {"x": {"type": "integer"}, "y": {"type": "integer"}}}}},
})
import json
p = json.loads(r.choices[0].message.content)
px, py = int(p["x"]/1000*W), int(p["y"]/1000*H) # -> pixel click point
smart_resize the screenshot first (Qwen factor = patch 16 × merge 2 = 32; min 65 536 / max 16.7 M px) so the server's internal resize doesn't shift coordinates, and scale against the dimensions you actually send.
Sampling — two configs
This is a two-mode model; sampling should match the mode (measured on the 12-target grounding fixture):
| Profile | Grounding accuracy | Use for |
|---|---|---|
greedy (temperature=0, no penalties) |
12/12 hit, median 3 px | grounding / localization |
instruct (t0.7, top_p0.8, top_k20, min_p0, presence_penalty1.5) |
11/12, median 27 px, jitters ±40 px | agentic / navigation / general non-thinking |
- Grounding →
temperature=0, no penalties. Deterministic, precise coordinates; this is also H Company's official localization setting.presence_penaltyon a 2-integer{x,y}actively skews coordinates — leave penalties off. MTP acceptance is lower here (~33–60%) but irrelevant: outputs are ~15 tokens, prefill dominates, so greedy costs no wall-clock. - Agentic / navigation / general (non-thinking) → instruct preset:
temperature=0.7, top_p=0.8, top_k=20, min_p=0, presence_penalty=1.5, repetition_penalty=1.0(Qwen3 anti-loop config). This is where MTP pays off (long traces) and where acceptance climbs >90% — temperature softens the rejection-sampling threshold so near-miss drafts are accepted.
generation_config.json ships the instruct sampling as the default (temp 0.7 / top_p 0.8 / top_k 20 / min_p 0); pass temperature=0 explicitly for grounding. presence_penalty is a serving/API param (not in generation_config.json) — add it only in agentic mode.
Notes
- MTP is auto-detected by oMLX from
mtp_num_hidden_layers; the log printsMTP path activated ... accept=N/M. - 36 GB resident. Two 38 GB models won't co-reside under oMLX's ~70 GB prefill guard — its LRU auto-evicts the idle one.
- Use the served model for both grounding and agentic/navigation modes; MTP helps the latter more.
Attribution
Apache-2.0. Built on the work of H Company (Holo-3.1), Qwen (Qwen3.6-35B-A3B base + MTP head), symrex (oQ8 quant), and tfjack (Qwen3.6 oQ8 MTP MLX conversion). Quantization tooling: oMLX / oQ.
- Downloads last month
- 47
8-bit
