philipjohnbasile commited on
Commit
c69d41c
Β·
verified Β·
1 Parent(s): ef322ad

#71 batching verified: 2.6x standalone / 1.74x live-serve

Browse files
Files changed (1) hide show
  1. SPEED.md +7 -3
SPEED.md CHANGED
@@ -22,9 +22,13 @@ Proven 4 independent ways, $0 spent (vs the $4–15K an EAGLE retrain would have
22
 
23
  ### What ACTUALLY delivers speed (and is already shipped)
24
  1. **Fused MoE dequant-matmul Metal kernel** (#33) β€” this *is* the 11–14 tok/s; it's the real win.
25
- 2. **Throughput via batching** β€” batched best-of-N (#35) + continuous batching (#48). On a memory-bound MoE,
26
- *parallel sequences* are the lever: one expert-load serves the whole batch, so tok/s **scales with batch**.
27
- This is where "faster" actually lives for this model β€” not single-stream.
 
 
 
 
28
 
29
  ### Free, orthogonal wins (prefill / TTFT β€” not decode tok/s)
30
  - **Prompt/prefix KV cache** (`05_serve.sh --prompt-cache-size`) β€” agentic loops resend the same system+tools;
 
22
 
23
  ### What ACTUALLY delivers speed (and is already shipped)
24
  1. **Fused MoE dequant-matmul Metal kernel** (#33) β€” this *is* the 11–14 tok/s; it's the real win.
25
+ 2. **Throughput via batching β€” MEASURED 2.6Γ— at B=8** (`scripts/91_batch_scaling.py`): total decode
26
+ **15.8 β†’ 27.1 β†’ 34.6 β†’ 41.1 tok/s** at B = 1 / 2 / 4 / 8. On a memory-bound MoE, *parallel sequences* are
27
+ the lever β€” one expert-load serves the whole batch (per-seq drops, total climbs). This is where "faster"
28
+ actually lives (best-of-N, proof-search, multi-agent, the flywheel), and `mlx_lm.server` batches concurrent
29
+ requests **natively** (`is_batchable`, no draft model β€” that's us), so the win is available at the serve (#71).
30
+ **Serve-verified: 1.74Γ— at B=6** on the live `mlx_lm.server` (concurrent vs sequential, `scripts/92_serve_batch_test.py`)
31
+ β€” already shipping; just fire concurrent requests. The ONLY measured speedup on this model that beats 1Γ—.
32
 
33
  ### Free, orthogonal wins (prefill / TTFT β€” not decode tok/s)
34
  - **Prompt/prefix KV cache** (`05_serve.sh --prompt-cache-size`) β€” agentic loops resend the same system+tools;