# Speed — the MEASURED reality (M5 Max 128 GB) **Root cause (confirmed): decode is memory-bandwidth bound** — every token reloads expert weights from the ~106 GB model. The honest conclusion that follows was *measured*, not guessed (2026-06-18): > **Every single-stream speedup lever has been benchmarked, and every one is dead on this MoE.** > ~11–14 tok/s is the hard memory floor. You cannot speculate your way under it. | Lever | Hoped | **MEASURED** | Why it fails | |---|---|---|---| | MTP self-speculative | ~2.6× | **0% accept** (`89_mtp_gate.py`) | the native head is UN-pruned (256 experts) vs our pruned 77 → its MoE/router don't match → garbage drafts | | External draft model | 1.5–2.5× | **0.32× (proxy)** | the batched *verify* forward reloads ~all 77 experts; also draft+main ≈126/128 GB + Metal-unstable | | Prompt-lookup (no model) | 2–4× | **0.32×** (`src/prompt_lookup.py`) | lossless ✓ but the same MoE verify-wall — 8.4 tok/forward, but each forward loads ~all experts | | dsa-block-size / index_topk | ~1.3× | **flat 1.00–1.03×** (`90_dsa_sweep.py`) | attention is NOT the bottleneck; the MoE expert-load dominates | | Reduce active experts 8→6 | ~25% | **quality-dead** (`22_speed_tune.py`) | the router was trained for top-8 of the pruned 77; can't drop post-prune without a full re-prune+heal | ### The physics A MoE decode step is bound by loading the active experts' weights. Any *multi-token* speculative forward (MTP, draft-model, or n-gram) loads the **union** of experts for those tokens (~all 77) → costs ~K× a single step → the per-forward token gain is exactly cancelled. So **no speculative method beats single-token decode here.** Proven 4 independent ways, $0 spent (vs the $4–15K an EAGLE retrain would have cost to discover the same thing). ### What ACTUALLY delivers speed (and is already shipped) 1. **Fused MoE dequant-matmul Metal kernel** (#33) — this *is* the 11–14 tok/s; it's the real win. 2. **Throughput via batching — MEASURED 2.6× at B=8** (`scripts/91_batch_scaling.py`): total decode **15.8 → 27.1 → 34.6 → 41.1 tok/s** at B = 1 / 2 / 4 / 8. On a memory-bound MoE, *parallel sequences* are the lever — one expert-load serves the whole batch (per-seq drops, total climbs). This is where "faster" actually lives (best-of-N, proof-search, multi-agent, the flywheel), and `mlx_lm.server` batches concurrent requests **natively** (`is_batchable`, no draft model — that's us), so the win is available at the serve (#71). **Serve-verified: 1.74× at B=6** on the live `mlx_lm.server` (concurrent vs sequential, `scripts/92_serve_batch_test.py`) — already shipping; just fire concurrent requests. The ONLY measured speedup on this model that beats 1×. ### Free, orthogonal wins (prefill / TTFT — not decode tok/s) - **Prompt/prefix KV cache** (`05_serve.sh --prompt-cache-size`) — agentic loops resend the same system+tools; cache the prefix → big *time-to-first-token* cut. Real, but it doesn't move the decode tok/s number. - **M5 Neural Accelerators / current MLX** — free generation + prefill gains from keeping `mlx` current. - **`08_think_proxy.py`** — skip the thinking trace on trivial structural steps → fewer tokens, not faster/token. ### The ONLY path to a single-stream speedup (not recommended) Train a **fresh EAGLE-3 head** on the demolished model's own outputs (architecture-agnostic, ~1B dense layer): ~$4–15K cloud H100 or ~weeks of local M5 data-gen. And even then it's **uncertain** — EAGLE on a fast MoE baseline measured ~1.03× in the literature (same verify-wall). See `#69`. Bank the batching instead. ### Bottom line Single-stream is **memory-capped at ~11–14 tok/s and that's the floor**. The speed pillar is the fused kernel (#33) + throughput batching (#35/#48), both built. Don't fight physics; batch for throughput. *(Receipts: `89_mtp_gate.py` → 0%, `src/prompt_lookup.py --bench` → 0.32×, `90_dsa_sweep.py` → flat. Full log in OVERNIGHT_LOG.md under the 2026-06-18 speed entries.)*