---
license: apache-2.0
library_name: mlx
pipeline_tag: image-text-to-text
base_model:
- Hcompany/Holo-3.1-35B-A3B
- symrex/Holo-3.1-35B-A3B-oQ8
- Qwen/Qwen3.6-35B-A3B
tags:
- mlx
- omlx
- oq
- quantized
- mtp
- speculative-decoding
- grounding
- computer-use
- gui
- qwen3_5_moe
---

# Holo-3.1-35B-A3B-oQ8-mtp

8-bit (oMLX `oQ8`) MLX build of **[Hcompany/Holo-3.1-35B-A3B](https://huggingface.co/Hcompany/Holo-3.1-35B-A3B)** — H Company's GUI-grounding / computer-use VLM — **with a Multi-Token-Prediction (MTP / nextn) head grafted in** so [oMLX](https://github.com/jundot/omlx) can run native speculative decoding.

It is [`symrex/Holo-3.1-35B-A3B-oQ8`](https://huggingface.co/symrex/Holo-3.1-35B-A3B-oQ8) (which dropped the MTP head) with the 8-bit MTP module transplanted from [`tfjack/Qwen3.6-35B-A3B-oQ8-fp16-mtp`](https://huggingface.co/tfjack/Qwen3.6-35B-A3B-oQ8-fp16-mtp) — the same `Qwen3.6-35B-A3B` base Holo-3.1 was fine-tuned from.

📦 **Code, scripts, and full write-up:** https://github.com/gaztrabisme/holo-3.1-a3b-mtp-mlx

## Why a graft?

**No Holo-3.1 release ships an MTP head** — H Company stripped the nextn layer when fine-tuning (verified across bf16, oQ8, GGUF, NVFP4). The MTP head therefore can only come from the `Qwen3.6-35B-A3B` base. Both models share a **byte-identical non-MTP tensor set** (2010 tensors, `model_type: qwen3_5_moe`, hidden 2048, 40 layers, 256 experts, vocab 248320), so the 42-tensor `language_model.mtp.*` block (already 8-bit, `group_size` matched) drops in with no transforms. Config gains `text_config.mtp_num_hidden_layers=1`, `mtp_use_dedicated_embeddings=False`, and 6 MTP quant-overrides.

## Verified on M3 Max (96 GB), single-stream, greedy

| | base oQ8 | **this (oQ8-mtp)** |
|---|---|---|
| Decode (300 tok) | 55.0 tok/s | **60.5 tok/s (~1.1×)** |
| MTP acceptance (structured text) | — | **94.8%** (145/153) |
| Output vs base | — | **byte-identical (lossless)** |

**Lossless by construction:** speculative decoding verifies every drafted token against the main model, so output is *exactly* Holo's greedy output — confirmed byte-identical on text (sha1 match) and on all 12 grounding targets below. The grafted head only affects *speed*, never correctness.

### Grounding spot-check (synthetic 1280×720 UI, 12 known targets)

**12/12 hit, median 6 px error, max 55 px.** Base and MTP returned identical coordinates on all 12. (MTP acceptance on grounding is lower, 33–60%, because outputs are tiny ~15-token `{x,y}` blobs; for pure grounding, screenshot *prefill* dominates latency and MTP barely moves it — it pays off on longer agentic/navigation traces.)

Green rings = ground truth, red crosshairs = Holo's predicted click (greedy, `temperature=0`):

![grounding spot-check — ground truth vs Holo predicted clicks, 12/12 hit](grounding_spotcheck.png)

## Usage (oMLX, grounding)

Holo-3.1 grounding = single user turn, **normalized `[0,1000]` JSON `{x,y}`** (not Holo1's pixel `Click(x,y)`). Use **greedy + thinking off**; greedy is both the correct setting for stable coordinates *and* what maximizes MTP acceptance.

```python
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="...")

W, H = 1280, 720  # send a smart_resize'd image; scale coords against THESE dims
prompt = (
    "Localize an element on the GUI image according to the provided target "
    "and output a click position.\n"
    " * You must output a valid JSON following the format: "
    '{"properties":{"x":{"type":"integer"},"y":{"type":"integer"}},"required":["x","y"]}\n'
    " Your target is:\nthe 'Save Changes' button")

r = client.chat.completions.create(
    model="Holo-3.1-35B-A3B-oQ8-mtp", temperature=0,
    messages=[{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
        {"type": "text", "text": prompt}]}],
    extra_body={
        "chat_template_kwargs": {"enable_thinking": False},
        "response_format": {"type": "json_schema", "json_schema": {"name": "point",
            "schema": {"type": "object", "required": ["x", "y"], "additionalProperties": False,
                       "properties": {"x": {"type": "integer"}, "y": {"type": "integer"}}}}},
    })
import json
p = json.loads(r.choices[0].message.content)
px, py = int(p["x"]/1000*W), int(p["y"]/1000*H)   # -> pixel click point
```

> **smart_resize the screenshot first** (Qwen factor = patch 16 × merge 2 = 32; min 65 536 / max 16.7 M px) so the server's internal resize doesn't shift coordinates, and scale against the dimensions you actually send.

## Sampling — two configs

This is a two-mode model; sampling should match the mode (measured on the 12-target grounding fixture):

| Profile | Grounding accuracy | Use for |
|---|---|---|
| **greedy** (`temperature=0`, no penalties) | **12/12 hit, median 3 px** | **grounding / localization** |
| instruct (`t0.7, top_p0.8, top_k20, min_p0, presence_penalty1.5`) | 11/12, median 27 px, jitters ±40 px | agentic / navigation / general non-thinking |

- **Grounding → `temperature=0`, no penalties.** Deterministic, precise coordinates; this is also H Company's official localization setting. `presence_penalty` on a 2-integer `{x,y}` actively skews coordinates — leave penalties off. MTP acceptance is lower here (~33–60%) but irrelevant: outputs are ~15 tokens, prefill dominates, so greedy costs no wall-clock.
- **Agentic / navigation / general (non-thinking) → instruct preset:** `temperature=0.7, top_p=0.8, top_k=20, min_p=0, presence_penalty=1.5, repetition_penalty=1.0` (Qwen3 anti-loop config). This is where MTP pays off (long traces) and where acceptance climbs **>90%** — temperature softens the rejection-sampling threshold so near-miss drafts are accepted.

`generation_config.json` ships the instruct sampling as the default (`temp 0.7 / top_p 0.8 / top_k 20 / min_p 0`); **pass `temperature=0` explicitly for grounding.** `presence_penalty` is a serving/API param (not in `generation_config.json`) — add it only in agentic mode.

## Notes

- MTP is auto-detected by oMLX from `mtp_num_hidden_layers`; the log prints `MTP path activated ... accept=N/M`.
- 36 GB resident. Two 38 GB models won't co-reside under oMLX's ~70 GB prefill guard — its LRU auto-evicts the idle one.
- Use the served model for both grounding and agentic/navigation modes; MTP helps the latter more.

## Attribution

Apache-2.0. Built on the work of **H Company** (Holo-3.1), **Qwen** (Qwen3.6-35B-A3B base + MTP head), **symrex** (oQ8 quant), and **tfjack** (Qwen3.6 oQ8 MTP MLX conversion). Quantization tooling: **oMLX / oQ**.