---
license: other
library_name: mlx
base_model: OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated
tags:
  - mlx
  - mtplx
  - mtp
  - speculative-decoding
  - qwen
  - qwen3
  - qwen3.5
  - moe
  - abliterated
  - uncensored
  - dpo
  - opus
  - qwopus
  - kimi
  - kimi-k2
  - multimodal
  - vision
  - 4-bit
pipeline_tag: image-text-to-text
---

<div align="center">
  <img src="OYM_banner.png" alt="OpenYourMind" width="100%">
</div>

## Support & Community

<div align="center">

**☕ If these models are useful to you, consider supporting my work — it funds compute for more & larger abliterations.**

<a href="https://buymeacoffee.com/oym.kuato" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" alt="Buy Me A Coffee" height="48" width="200"></a>

[**buymeacoffee.com/oym.kuato**](https://buymeacoffee.com/oym.kuato)

💬 **Discord:** [discord.gg/rhUZY5GEZr](https://discord.gg/rhUZY5GEZr) &nbsp;&middot;&nbsp; ₿ **Bitcoin:** `bc1qsvfduzj9fjs9fugpc52yver3f2g8fp7xjxecdv`

</div>

---

# Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated — MTPLX 4-bit (MoE MTP head)

## Overview

This is the MLX **4-bit** build of [`OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated`](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated) **with the Qwen3.5 MoE Multi-Token-Prediction (MTP) head included**, packaged for **[MTPLX](https://github.com/youssofal/mtplx)** native MTP speculative decoding on Apple Silicon.

The language/vision weights are byte-identical to the [`-MLX-4bit`](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated-MLX-4bit) build. The only additions are the MTP head (`mtp.safetensors`, **BF16**, 4.7 GB) and a `config.json` pointer (`mlx_lm_extra_tensors.mtp_file`). That sidecar is ignored by plain `mlx-lm`/`mlx-vlm`, so this folder still loads as an ordinary MLX model — but with MTPLX it also drives speculative decoding.

- **Language**: 4-bit, group size 64 (MoE routing gates kept at higher precision by the model's quant predicate), ≈ 4.5 bits/weight.
- **MTP head**: 1 layer, **MoE** (router + 256 experts / 8 active + shared expert), full self-attention, **BF16** (785 tensors). MTPLX stacks the experts into `switch_mlp` at load and verifies every drafted token against the target model.
- **Vision**: the BF16 vision tower from the base build is still present; MTPLX runs the text path only. For image input, use the `-MLX-4bit` repo with `mlx-vlm`.

## ⚠️ Requires MTPLX with Qwen3.5-MoE MTP support

The Qwen3.5 MTP head is an **MoE** block. MTPLX ≤ 0.3.7 only supported a *dense* Qwen MTP head and will reject this model with `invalid-mtp-tensor-layout`. Support is added in **[MTPLX PR #84](https://github.com/youssofal/MTPLX/pull/84)**.

Until that lands in a release, install from the branch:

```bash
pip install "git+https://github.com/janfeddersen-wq/MTPLX.git@qwen3-5-moe-mtp"
# after the PR is merged & released:  pip install -U mtplx
```

## Usage

```bash
MODEL=OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated-MTPLX-4bit

# one-shot, with acceptance stats
mtplx ask --model "$MODEL" --prompt "Explain Rayleigh scattering simply." --mtp --stats --yes

# interactive terminal chat
mtplx start cli --model "$MODEL" --yes

# OpenAI-compatible server
mtplx quickstart --model "$MODEL" --port 8000 --yes
```

`--yes` accepts the "family-compatible-unverified" gate (no recorded exactness baseline is shipped). Add `--no-mtp` to compare against plain autoregressive decoding.

## Measured (M5 Max, 128 GB)

120-token greedy run: depth-1 acceptance **≈ 70 %**, `accepted_by_depth = [40, 19, 3]` of `[57, 57, 56]` drafted → 120 tokens in **57 target verify passes (≈ 2.1 tokens/verify)**, ~52 decode tok/s. (Contrary to the earlier note on the base card, the MoE MTP head **does** yield a real speedup once a runtime can consume it.)

## Known limitation — MoE exactness

At temperature 0, MTP vs non-MTP greedy output is ~98 % identical and re-converges immediately, but occasionally flips a single token. This is the MoE router hitting a near-tie that resolves differently under batched verification vs single-token decode (an inherent MoE/FP effect), **not** a drafting error — the target model verifies every token. Strict bit-exactness for MoE heads is still being worked out (e.g. fp32 router logits during verify); see PR #84.

## Files

| File | Description | Size |
|------|-------------|------|
| `model-*-of-00014.safetensors` | 4-bit language weights + BF16 vision tower | ~65 GB |
| `mtp.safetensors` | **MoE MTP head (BF16)** | 4.7 GB |
| `config.json` | `Qwen3_5MoeForConditionalGeneration` + `quantization` + `mlx_lm_extra_tensors.mtp_file` | — |
| `tokenizer*`, `chat_template.jinja`, `generation_config.json`, processor configs | Standard | — |

Total on disk: **~70 GB**.

## Hardware

Needs roughly **≥ 80 GB unified memory** to load with usable context (65 GB base + ~5 GB BF16 MTP + KV cache). Runs comfortably on 96 GB+ M-series Macs.

## Notes

- **License**: Other (inherits from the Qwen3.5 base license)
- **Parent (full weights)**: [Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated)
- **Plain 4-bit MLX (no MTP, for `mlx-vlm`/LM Studio)**: [`-MLX-4bit`](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated-MLX-4bit)
- **Architecture**: Qwen3.5 MoE (~10B active / 122B total) + Qwen3-VL vision tower + MoE MTP head

## Disclaimer

Use is the responsibility of the user. Ensure your usage complies with applicable laws, platform rules, and deployment requirements.