philipjohnbasile's picture
#71 batching verified: 2.6x standalone / 1.74x live-serve
c69d41c verified
|
Raw
History Blame Contribute Delete
4.07 kB
# Speed β€” the MEASURED reality (M5 Max 128 GB)
**Root cause (confirmed): decode is memory-bandwidth bound** β€” every token reloads expert weights from the
~106 GB model. The honest conclusion that follows was *measured*, not guessed (2026-06-18):
> **Every single-stream speedup lever has been benchmarked, and every one is dead on this MoE.**
> ~11–14 tok/s is the hard memory floor. You cannot speculate your way under it.
| Lever | Hoped | **MEASURED** | Why it fails |
|---|---|---|---|
| MTP self-speculative | ~2.6Γ— | **0% accept** (`89_mtp_gate.py`) | the native head is UN-pruned (256 experts) vs our pruned 77 β†’ its MoE/router don't match β†’ garbage drafts |
| External draft model | 1.5–2.5Γ— | **0.32Γ— (proxy)** | the batched *verify* forward reloads ~all 77 experts; also draft+main β‰ˆ126/128 GB + Metal-unstable |
| Prompt-lookup (no model) | 2–4Γ— | **0.32Γ—** (`src/prompt_lookup.py`) | lossless βœ“ but the same MoE verify-wall β€” 8.4 tok/forward, but each forward loads ~all experts |
| dsa-block-size / index_topk | ~1.3Γ— | **flat 1.00–1.03Γ—** (`90_dsa_sweep.py`) | attention is NOT the bottleneck; the MoE expert-load dominates |
| Reduce active experts 8β†’6 | ~25% | **quality-dead** (`22_speed_tune.py`) | the router was trained for top-8 of the pruned 77; can't drop post-prune without a full re-prune+heal |
### The physics
A MoE decode step is bound by loading the active experts' weights. Any *multi-token* speculative forward
(MTP, draft-model, or n-gram) loads the **union** of experts for those tokens (~all 77) β†’ costs ~KΓ— a single
step β†’ the per-forward token gain is exactly cancelled. So **no speculative method beats single-token decode here.**
Proven 4 independent ways, $0 spent (vs the $4–15K an EAGLE retrain would have cost to discover the same thing).
### What ACTUALLY delivers speed (and is already shipped)
1. **Fused MoE dequant-matmul Metal kernel** (#33) β€” this *is* the 11–14 tok/s; it's the real win.
2. **Throughput via batching β€” MEASURED 2.6Γ— at B=8** (`scripts/91_batch_scaling.py`): total decode
**15.8 β†’ 27.1 β†’ 34.6 β†’ 41.1 tok/s** at B = 1 / 2 / 4 / 8. On a memory-bound MoE, *parallel sequences* are
the lever β€” one expert-load serves the whole batch (per-seq drops, total climbs). This is where "faster"
actually lives (best-of-N, proof-search, multi-agent, the flywheel), and `mlx_lm.server` batches concurrent
requests **natively** (`is_batchable`, no draft model β€” that's us), so the win is available at the serve (#71).
**Serve-verified: 1.74Γ— at B=6** on the live `mlx_lm.server` (concurrent vs sequential, `scripts/92_serve_batch_test.py`)
β€” already shipping; just fire concurrent requests. The ONLY measured speedup on this model that beats 1Γ—.
### Free, orthogonal wins (prefill / TTFT β€” not decode tok/s)
- **Prompt/prefix KV cache** (`05_serve.sh --prompt-cache-size`) β€” agentic loops resend the same system+tools;
cache the prefix β†’ big *time-to-first-token* cut. Real, but it doesn't move the decode tok/s number.
- **M5 Neural Accelerators / current MLX** β€” free generation + prefill gains from keeping `mlx` current.
- **`08_think_proxy.py`** β€” skip the thinking trace on trivial structural steps β†’ fewer tokens, not faster/token.
### The ONLY path to a single-stream speedup (not recommended)
Train a **fresh EAGLE-3 head** on the demolished model's own outputs (architecture-agnostic, ~1B dense layer):
~$4–15K cloud H100 or ~weeks of local M5 data-gen. And even then it's **uncertain** β€” EAGLE on a fast MoE
baseline measured ~1.03Γ— in the literature (same verify-wall). See `#69`. Bank the batching instead.
### Bottom line
Single-stream is **memory-capped at ~11–14 tok/s and that's the floor**. The speed pillar is the fused kernel
(#33) + throughput batching (#35/#48), both built. Don't fight physics; batch for throughput.
*(Receipts: `89_mtp_gate.py` β†’ 0%, `src/prompt_lookup.py --bench` β†’ 0.32Γ—, `90_dsa_sweep.py` β†’ flat. Full log
in OVERNIGHT_LOG.md under the 2026-06-18 speed entries.)*