Holo-3.1-35B-A3B-oQ8-mtp

8-bit (oMLX oQ8) MLX build of Hcompany/Holo-3.1-35B-A3B — H Company's GUI-grounding / computer-use VLM — with a Multi-Token-Prediction (MTP / nextn) head grafted in so oMLX can run native speculative decoding.

It is symrex/Holo-3.1-35B-A3B-oQ8 (which dropped the MTP head) with the 8-bit MTP module transplanted from tfjack/Qwen3.6-35B-A3B-oQ8-fp16-mtp — the same Qwen3.6-35B-A3B base Holo-3.1 was fine-tuned from.

📦 Code, scripts, and full write-up: https://github.com/gaztrabisme/holo-3.1-a3b-mtp-mlx

Why a graft?

No Holo-3.1 release ships an MTP head — H Company stripped the nextn layer when fine-tuning (verified across bf16, oQ8, GGUF, NVFP4). The MTP head therefore can only come from the Qwen3.6-35B-A3B base. Both models share a byte-identical non-MTP tensor set (2010 tensors, model_type: qwen3_5_moe, hidden 2048, 40 layers, 256 experts, vocab 248320), so the 42-tensor language_model.mtp.* block (already 8-bit, group_size matched) drops in with no transforms. Config gains text_config.mtp_num_hidden_layers=1, mtp_use_dedicated_embeddings=False, and 6 MTP quant-overrides.

Verified on M3 Max (96 GB), single-stream, greedy

base oQ8 this (oQ8-mtp)
Decode (300 tok) 55.0 tok/s 60.5 tok/s (~1.1×)
MTP acceptance (structured text) 94.8% (145/153)
Output vs base byte-identical (lossless)

Lossless by construction: speculative decoding verifies every drafted token against the main model, so output is exactly Holo's greedy output — confirmed byte-identical on text (sha1 match) and on all 12 grounding targets below. The grafted head only affects speed, never correctness.

Grounding spot-check (synthetic 1280×720 UI, 12 known targets)

12/12 hit, median 6 px error, max 55 px. Base and MTP returned identical coordinates on all 12. (MTP acceptance on grounding is lower, 33–60%, because outputs are tiny ~15-token {x,y} blobs; for pure grounding, screenshot prefill dominates latency and MTP barely moves it — it pays off on longer agentic/navigation traces.)

Green rings = ground truth, red crosshairs = Holo's predicted click (greedy, temperature=0):

grounding spot-check — ground truth vs Holo predicted clicks, 12/12 hit

Usage (oMLX, grounding)

Holo-3.1 grounding = single user turn, normalized [0,1000] JSON {x,y} (not Holo1's pixel Click(x,y)). Use greedy + thinking off; greedy is both the correct setting for stable coordinates and what maximizes MTP acceptance.

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="...")

W, H = 1280, 720  # send a smart_resize'd image; scale coords against THESE dims
prompt = (
    "Localize an element on the GUI image according to the provided target "
    "and output a click position.\n"
    " * You must output a valid JSON following the format: "
    '{"properties":{"x":{"type":"integer"},"y":{"type":"integer"}},"required":["x","y"]}\n'
    " Your target is:\nthe 'Save Changes' button")

r = client.chat.completions.create(
    model="Holo-3.1-35B-A3B-oQ8-mtp", temperature=0,
    messages=[{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
        {"type": "text", "text": prompt}]}],
    extra_body={
        "chat_template_kwargs": {"enable_thinking": False},
        "response_format": {"type": "json_schema", "json_schema": {"name": "point",
            "schema": {"type": "object", "required": ["x", "y"], "additionalProperties": False,
                       "properties": {"x": {"type": "integer"}, "y": {"type": "integer"}}}}},
    })
import json
p = json.loads(r.choices[0].message.content)
px, py = int(p["x"]/1000*W), int(p["y"]/1000*H)   # -> pixel click point

smart_resize the screenshot first (Qwen factor = patch 16 × merge 2 = 32; min 65 536 / max 16.7 M px) so the server's internal resize doesn't shift coordinates, and scale against the dimensions you actually send.

Sampling — two configs

This is a two-mode model; sampling should match the mode (measured on the 12-target grounding fixture):

Profile Grounding accuracy Use for
greedy (temperature=0, no penalties) 12/12 hit, median 3 px grounding / localization
instruct (t0.7, top_p0.8, top_k20, min_p0, presence_penalty1.5) 11/12, median 27 px, jitters ±40 px agentic / navigation / general non-thinking
  • Grounding → temperature=0, no penalties. Deterministic, precise coordinates; this is also H Company's official localization setting. presence_penalty on a 2-integer {x,y} actively skews coordinates — leave penalties off. MTP acceptance is lower here (~33–60%) but irrelevant: outputs are ~15 tokens, prefill dominates, so greedy costs no wall-clock.
  • Agentic / navigation / general (non-thinking) → instruct preset: temperature=0.7, top_p=0.8, top_k=20, min_p=0, presence_penalty=1.5, repetition_penalty=1.0 (Qwen3 anti-loop config). This is where MTP pays off (long traces) and where acceptance climbs >90% — temperature softens the rejection-sampling threshold so near-miss drafts are accepted.

generation_config.json ships the instruct sampling as the default (temp 0.7 / top_p 0.8 / top_k 20 / min_p 0); pass temperature=0 explicitly for grounding. presence_penalty is a serving/API param (not in generation_config.json) — add it only in agentic mode.

Notes

  • MTP is auto-detected by oMLX from mtp_num_hidden_layers; the log prints MTP path activated ... accept=N/M.
  • 36 GB resident. Two 38 GB models won't co-reside under oMLX's ~70 GB prefill guard — its LRU auto-evicts the idle one.
  • Use the served model for both grounding and agentic/navigation modes; MTP helps the latter more.

Attribution

Apache-2.0. Built on the work of H Company (Holo-3.1), Qwen (Qwen3.6-35B-A3B base + MTP head), symrex (oQ8 quant), and tfjack (Qwen3.6 oQ8 MTP MLX conversion). Quantization tooling: oMLX / oQ.

Downloads last month
47
Safetensors
Model size
10B params
Tensor type
BF16
·
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GazTrab/Holo-3.1-35B-A3B-oQ8-mtp

Quantized
(10)
this model