philipjohnbasile's picture
#71 batching verified: 2.6x standalone / 1.74x live-serve
c69d41c verified
|
Raw
History Blame Contribute Delete
4.07 kB

Speed β€” the MEASURED reality (M5 Max 128 GB)

Root cause (confirmed): decode is memory-bandwidth bound β€” every token reloads expert weights from the ~106 GB model. The honest conclusion that follows was measured, not guessed (2026-06-18):

Every single-stream speedup lever has been benchmarked, and every one is dead on this MoE. ~11–14 tok/s is the hard memory floor. You cannot speculate your way under it.

Lever Hoped MEASURED Why it fails
MTP self-speculative ~2.6Γ— 0% accept (89_mtp_gate.py) the native head is UN-pruned (256 experts) vs our pruned 77 β†’ its MoE/router don't match β†’ garbage drafts
External draft model 1.5–2.5Γ— 0.32Γ— (proxy) the batched verify forward reloads ~all 77 experts; also draft+main β‰ˆ126/128 GB + Metal-unstable
Prompt-lookup (no model) 2–4Γ— 0.32Γ— (src/prompt_lookup.py) lossless βœ“ but the same MoE verify-wall β€” 8.4 tok/forward, but each forward loads ~all experts
dsa-block-size / index_topk ~1.3Γ— flat 1.00–1.03Γ— (90_dsa_sweep.py) attention is NOT the bottleneck; the MoE expert-load dominates
Reduce active experts 8β†’6 ~25% quality-dead (22_speed_tune.py) the router was trained for top-8 of the pruned 77; can't drop post-prune without a full re-prune+heal

The physics

A MoE decode step is bound by loading the active experts' weights. Any multi-token speculative forward (MTP, draft-model, or n-gram) loads the union of experts for those tokens (~all 77) β†’ costs ~KΓ— a single step β†’ the per-forward token gain is exactly cancelled. So no speculative method beats single-token decode here. Proven 4 independent ways, $0 spent (vs the $4–15K an EAGLE retrain would have cost to discover the same thing).

What ACTUALLY delivers speed (and is already shipped)

  1. Fused MoE dequant-matmul Metal kernel (#33) β€” this is the 11–14 tok/s; it's the real win.
  2. Throughput via batching β€” MEASURED 2.6Γ— at B=8 (scripts/91_batch_scaling.py): total decode 15.8 β†’ 27.1 β†’ 34.6 β†’ 41.1 tok/s at B = 1 / 2 / 4 / 8. On a memory-bound MoE, parallel sequences are the lever β€” one expert-load serves the whole batch (per-seq drops, total climbs). This is where "faster" actually lives (best-of-N, proof-search, multi-agent, the flywheel), and mlx_lm.server batches concurrent requests natively (is_batchable, no draft model β€” that's us), so the win is available at the serve (#71). Serve-verified: 1.74Γ— at B=6 on the live mlx_lm.server (concurrent vs sequential, scripts/92_serve_batch_test.py) β€” already shipping; just fire concurrent requests. The ONLY measured speedup on this model that beats 1Γ—.

Free, orthogonal wins (prefill / TTFT β€” not decode tok/s)

  • Prompt/prefix KV cache (05_serve.sh --prompt-cache-size) β€” agentic loops resend the same system+tools; cache the prefix β†’ big time-to-first-token cut. Real, but it doesn't move the decode tok/s number.
  • M5 Neural Accelerators / current MLX β€” free generation + prefill gains from keeping mlx current.
  • 08_think_proxy.py β€” skip the thinking trace on trivial structural steps β†’ fewer tokens, not faster/token.

The ONLY path to a single-stream speedup (not recommended)

Train a fresh EAGLE-3 head on the demolished model's own outputs (architecture-agnostic, ~1B dense layer): ~$4–15K cloud H100 or ~weeks of local M5 data-gen. And even then it's uncertain β€” EAGLE on a fast MoE baseline measured ~1.03Γ— in the literature (same verify-wall). See #69. Bank the batching instead.

Bottom line

Single-stream is memory-capped at ~11–14 tok/s and that's the floor. The speed pillar is the fused kernel (#33) + throughput batching (#35/#48), both built. Don't fight physics; batch for throughput.

(Receipts: 89_mtp_gate.py β†’ 0%, src/prompt_lookup.py --bench β†’ 0.32Γ—, 90_dsa_sweep.py β†’ flat. Full log in OVERNIGHT_LOG.md under the 2026-06-18 speed entries.)