--- license: apache-2.0 library_name: mlx pipeline_tag: image-text-to-text base_model: - Hcompany/Holo-3.1-35B-A3B - symrex/Holo-3.1-35B-A3B-oQ8 - Qwen/Qwen3.6-35B-A3B tags: - mlx - omlx - oq - quantized - mtp - speculative-decoding - grounding - computer-use - gui - qwen3_5_moe --- # Holo-3.1-35B-A3B-oQ8-mtp 8-bit (oMLX `oQ8`) MLX build of **[Hcompany/Holo-3.1-35B-A3B](https://huggingface.co/Hcompany/Holo-3.1-35B-A3B)** — H Company's GUI-grounding / computer-use VLM — **with a Multi-Token-Prediction (MTP / nextn) head grafted in** so [oMLX](https://github.com/jundot/omlx) can run native speculative decoding. It is [`symrex/Holo-3.1-35B-A3B-oQ8`](https://huggingface.co/symrex/Holo-3.1-35B-A3B-oQ8) (which dropped the MTP head) with the 8-bit MTP module transplanted from [`tfjack/Qwen3.6-35B-A3B-oQ8-fp16-mtp`](https://huggingface.co/tfjack/Qwen3.6-35B-A3B-oQ8-fp16-mtp) — the same `Qwen3.6-35B-A3B` base Holo-3.1 was fine-tuned from. 📦 **Code, scripts, and full write-up:** https://github.com/gaztrabisme/holo-3.1-a3b-mtp-mlx ## Why a graft? **No Holo-3.1 release ships an MTP head** — H Company stripped the nextn layer when fine-tuning (verified across bf16, oQ8, GGUF, NVFP4). The MTP head therefore can only come from the `Qwen3.6-35B-A3B` base. Both models share a **byte-identical non-MTP tensor set** (2010 tensors, `model_type: qwen3_5_moe`, hidden 2048, 40 layers, 256 experts, vocab 248320), so the 42-tensor `language_model.mtp.*` block (already 8-bit, `group_size` matched) drops in with no transforms. Config gains `text_config.mtp_num_hidden_layers=1`, `mtp_use_dedicated_embeddings=False`, and 6 MTP quant-overrides. ## Verified on M3 Max (96 GB), single-stream, greedy | | base oQ8 | **this (oQ8-mtp)** | |---|---|---| | Decode (300 tok) | 55.0 tok/s | **60.5 tok/s (~1.1×)** | | MTP acceptance (structured text) | — | **94.8%** (145/153) | | Output vs base | — | **byte-identical (lossless)** | **Lossless by construction:** speculative decoding verifies every drafted token against the main model, so output is *exactly* Holo's greedy output — confirmed byte-identical on text (sha1 match) and on all 12 grounding targets below. The grafted head only affects *speed*, never correctness. ### Grounding spot-check (synthetic 1280×720 UI, 12 known targets) **12/12 hit, median 6 px error, max 55 px.** Base and MTP returned identical coordinates on all 12. (MTP acceptance on grounding is lower, 33–60%, because outputs are tiny ~15-token `{x,y}` blobs; for pure grounding, screenshot *prefill* dominates latency and MTP barely moves it — it pays off on longer agentic/navigation traces.) Green rings = ground truth, red crosshairs = Holo's predicted click (greedy, `temperature=0`): ![grounding spot-check — ground truth vs Holo predicted clicks, 12/12 hit](grounding_spotcheck.png) ## Usage (oMLX, grounding) Holo-3.1 grounding = single user turn, **normalized `[0,1000]` JSON `{x,y}`** (not Holo1's pixel `Click(x,y)`). Use **greedy + thinking off**; greedy is both the correct setting for stable coordinates *and* what maximizes MTP acceptance. ```python from openai import OpenAI client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="...") W, H = 1280, 720 # send a smart_resize'd image; scale coords against THESE dims prompt = ( "Localize an element on the GUI image according to the provided target " "and output a click position.\n" " * You must output a valid JSON following the format: " '{"properties":{"x":{"type":"integer"},"y":{"type":"integer"}},"required":["x","y"]}\n' " Your target is:\nthe 'Save Changes' button") r = client.chat.completions.create( model="Holo-3.1-35B-A3B-oQ8-mtp", temperature=0, messages=[{"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}, {"type": "text", "text": prompt}]}], extra_body={ "chat_template_kwargs": {"enable_thinking": False}, "response_format": {"type": "json_schema", "json_schema": {"name": "point", "schema": {"type": "object", "required": ["x", "y"], "additionalProperties": False, "properties": {"x": {"type": "integer"}, "y": {"type": "integer"}}}}}, }) import json p = json.loads(r.choices[0].message.content) px, py = int(p["x"]/1000*W), int(p["y"]/1000*H) # -> pixel click point ``` > **smart_resize the screenshot first** (Qwen factor = patch 16 × merge 2 = 32; min 65 536 / max 16.7 M px) so the server's internal resize doesn't shift coordinates, and scale against the dimensions you actually send. ## Sampling — two configs This is a two-mode model; sampling should match the mode (measured on the 12-target grounding fixture): | Profile | Grounding accuracy | Use for | |---|---|---| | **greedy** (`temperature=0`, no penalties) | **12/12 hit, median 3 px** | **grounding / localization** | | instruct (`t0.7, top_p0.8, top_k20, min_p0, presence_penalty1.5`) | 11/12, median 27 px, jitters ±40 px | agentic / navigation / general non-thinking | - **Grounding → `temperature=0`, no penalties.** Deterministic, precise coordinates; this is also H Company's official localization setting. `presence_penalty` on a 2-integer `{x,y}` actively skews coordinates — leave penalties off. MTP acceptance is lower here (~33–60%) but irrelevant: outputs are ~15 tokens, prefill dominates, so greedy costs no wall-clock. - **Agentic / navigation / general (non-thinking) → instruct preset:** `temperature=0.7, top_p=0.8, top_k=20, min_p=0, presence_penalty=1.5, repetition_penalty=1.0` (Qwen3 anti-loop config). This is where MTP pays off (long traces) and where acceptance climbs **>90%** — temperature softens the rejection-sampling threshold so near-miss drafts are accepted. `generation_config.json` ships the instruct sampling as the default (`temp 0.7 / top_p 0.8 / top_k 20 / min_p 0`); **pass `temperature=0` explicitly for grounding.** `presence_penalty` is a serving/API param (not in `generation_config.json`) — add it only in agentic mode. ## Notes - MTP is auto-detected by oMLX from `mtp_num_hidden_layers`; the log prints `MTP path activated ... accept=N/M`. - 36 GB resident. Two 38 GB models won't co-reside under oMLX's ~70 GB prefill guard — its LRU auto-evicts the idle one. - Use the served model for both grounding and agentic/navigation modes; MTP helps the latter more. ## Attribution Apache-2.0. Built on the work of **H Company** (Holo-3.1), **Qwen** (Qwen3.6-35B-A3B base + MTP head), **symrex** (oQ8 quant), and **tfjack** (Qwen3.6 oQ8 MTP MLX conversion). Quantization tooling: **oMLX / oQ**.