---
license: other
library_name: mlx
base_model: OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated
tags:
- mlx
- mtplx
- mtp
- speculative-decoding
- qwen
- qwen3
- qwen3.5
- moe
- abliterated
- uncensored
- dpo
- opus
- qwopus
- kimi
- kimi-k2
- multimodal
- vision
- 4-bit
pipeline_tag: image-text-to-text
---
## Support & Community
**☕ If these models are useful to you, consider supporting my work — it funds compute for more & larger abliterations.**

[**buymeacoffee.com/oym.kuato**](https://buymeacoffee.com/oym.kuato)
💬 **Discord:** [discord.gg/rhUZY5GEZr](https://discord.gg/rhUZY5GEZr) · ₿ **Bitcoin:** `bc1qsvfduzj9fjs9fugpc52yver3f2g8fp7xjxecdv`
---
# Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated — MTPLX 4-bit (MoE MTP head)
## Overview
This is the MLX **4-bit** build of [`OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated`](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated) **with the Qwen3.5 MoE Multi-Token-Prediction (MTP) head included**, packaged for **[MTPLX](https://github.com/youssofal/mtplx)** native MTP speculative decoding on Apple Silicon.
The language/vision weights are byte-identical to the [`-MLX-4bit`](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated-MLX-4bit) build. The only additions are the MTP head (`mtp.safetensors`, **BF16**, 4.7 GB) and a `config.json` pointer (`mlx_lm_extra_tensors.mtp_file`). That sidecar is ignored by plain `mlx-lm`/`mlx-vlm`, so this folder still loads as an ordinary MLX model — but with MTPLX it also drives speculative decoding.
- **Language**: 4-bit, group size 64 (MoE routing gates kept at higher precision by the model's quant predicate), ≈ 4.5 bits/weight.
- **MTP head**: 1 layer, **MoE** (router + 256 experts / 8 active + shared expert), full self-attention, **BF16** (785 tensors). MTPLX stacks the experts into `switch_mlp` at load and verifies every drafted token against the target model.
- **Vision**: the BF16 vision tower from the base build is still present; MTPLX runs the text path only. For image input, use the `-MLX-4bit` repo with `mlx-vlm`.
## ⚠️ Requires MTPLX with Qwen3.5-MoE MTP support
The Qwen3.5 MTP head is an **MoE** block. MTPLX ≤ 0.3.7 only supported a *dense* Qwen MTP head and will reject this model with `invalid-mtp-tensor-layout`. Support is added in **[MTPLX PR #84](https://github.com/youssofal/MTPLX/pull/84)**.
Until that lands in a release, install from the branch:
```bash
pip install "git+https://github.com/janfeddersen-wq/MTPLX.git@qwen3-5-moe-mtp"
# after the PR is merged & released: pip install -U mtplx
```
## Usage
```bash
MODEL=OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated-MTPLX-4bit
# one-shot, with acceptance stats
mtplx ask --model "$MODEL" --prompt "Explain Rayleigh scattering simply." --mtp --stats --yes
# interactive terminal chat
mtplx start cli --model "$MODEL" --yes
# OpenAI-compatible server
mtplx quickstart --model "$MODEL" --port 8000 --yes
```
`--yes` accepts the "family-compatible-unverified" gate (no recorded exactness baseline is shipped). Add `--no-mtp` to compare against plain autoregressive decoding.
## Measured (M5 Max, 128 GB)
120-token greedy run: depth-1 acceptance **≈ 70 %**, `accepted_by_depth = [40, 19, 3]` of `[57, 57, 56]` drafted → 120 tokens in **57 target verify passes (≈ 2.1 tokens/verify)**, ~52 decode tok/s. (Contrary to the earlier note on the base card, the MoE MTP head **does** yield a real speedup once a runtime can consume it.)
## Known limitation — MoE exactness
At temperature 0, MTP vs non-MTP greedy output is ~98 % identical and re-converges immediately, but occasionally flips a single token. This is the MoE router hitting a near-tie that resolves differently under batched verification vs single-token decode (an inherent MoE/FP effect), **not** a drafting error — the target model verifies every token. Strict bit-exactness for MoE heads is still being worked out (e.g. fp32 router logits during verify); see PR #84.
## Files
| File | Description | Size |
|------|-------------|------|
| `model-*-of-00014.safetensors` | 4-bit language weights + BF16 vision tower | ~65 GB |
| `mtp.safetensors` | **MoE MTP head (BF16)** | 4.7 GB |
| `config.json` | `Qwen3_5MoeForConditionalGeneration` + `quantization` + `mlx_lm_extra_tensors.mtp_file` | — |
| `tokenizer*`, `chat_template.jinja`, `generation_config.json`, processor configs | Standard | — |
Total on disk: **~70 GB**.
## Hardware
Needs roughly **≥ 80 GB unified memory** to load with usable context (65 GB base + ~5 GB BF16 MTP + KV cache). Runs comfortably on 96 GB+ M-series Macs.
## Notes
- **License**: Other (inherits from the Qwen3.5 base license)
- **Parent (full weights)**: [Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated)
- **Plain 4-bit MLX (no MTP, for `mlx-vlm`/LM Studio)**: [`-MLX-4bit`](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated-MLX-4bit)
- **Architecture**: Qwen3.5 MoE (~10B active / 122B total) + Qwen3-VL vision tower + MoE MTP head
## Disclaimer
Use is the responsibility of the user. Ensure your usage complies with applicable laws, platform rules, and deployment requirements.