AEON-7 Claude Opus 4.8 (1M context) commited on
Commit
14c6fe4
·
verified ·
1 Parent(s): 593810f

docs(perf): v0.23.0 benchmarks on aeon-vllm-ultimate:latest + charts

Browse files

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

README.md CHANGED
@@ -68,9 +68,11 @@ tags:
68
 
69
  # Gemma-4-12B-it AEON Abliterated — K=4 Multi-Direction Biprojection (BF16)
70
 
71
- > ## ✅ Verified working in native vLLM (2026-06-05)
72
  >
73
- > Smoke-tested end-to-end via vLLM's native `Gemma4UnifiedForConditionalGeneration` loader in `ghcr.io/aeon-7/aeon-vllm-ultimate:v0.22.2-pr44389-aeon-spark-gemma4unified` on DGX Spark GB10. The loader for Google's encoder-free Gemma-4-12B was added upstream in `gemma4_unified.py` (after the initial PR #38826 merge). Our previous "blocked on architecture mismatch" notes are obsolete.
 
 
74
  >
75
  > ### Single-stream by category (greedy, no spec decode, max_tokens=250)
76
  >
@@ -120,6 +122,46 @@ tags:
120
  > ```
121
 
122
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
  ## Available formats / quant grid
124
 
125
  | Variant | Repo | Precision | Size | Pick when |
@@ -133,9 +175,7 @@ tags:
133
 
134
  ### Recommended runtime
135
 
136
- For **Gemma-4 production today, use the previous AEON-7 image** [`ghcr.io/aeon-7/aeon-gemma-4-26b-a4b-dflash:latest`](https://github.com/AEON-7/packages?repo_name=aeon-gemma-4-26b-a4b-dflash) (vLLM 0.20.1, known-good for the multimodal `Gemma4UnifiedForConditionalGeneration` path) or load with `transformers` directly.
137
-
138
- > ⚠️ The newer **[AEON vLLM Ultimate](https://github.com/AEON-7/vllm-ultimate-dgx-spark)** container (`ghcr.io/aeon-7/aeon-vllm-ultimate:latest`) bundles vLLM 0.22.1 + PR #44389 (NVFP4 KV cache, 3× capacity) + DFlash + TurboQuant for our **Qwen3.6 family** — but as of 2026-06-04 it has **upstream PR #44389 bugs on Gemma-4**: the multimodal fallback hits a shape mismatch in `Gemma4UnifiedForConditionalGeneration`, and the modelopt path doesn't yet recognize `NVFP4_SVD`. A patched tag will be published when upstream merges fixes. Track the container's [Known issues](https://github.com/AEON-7/vllm-ultimate-dgx-spark#known-issues).
139
 
140
  ## Capability Comparison vs Base
141
 
 
68
 
69
  # Gemma-4-12B-it AEON Abliterated — K=4 Multi-Direction Biprojection (BF16)
70
 
71
+ > ## ✅ Verified working in native vLLM (re-validated v0.23.0, 2026-06-18)
72
  >
73
+ > Smoke-tested and benchmarked end-to-end via vLLM's native `Gemma4UnifiedForConditionalGeneration` loader on the unified **`ghcr.io/aeon-7/aeon-vllm-ultimate:latest`** image (vLLM **0.23.0**, sm_121a) on DGX Spark GB10. The loader for Google's encoder-free Gemma-4-12B was added upstream in `gemma4_unified.py` (after the initial PR #38826 merge). Our previous "blocked on architecture mismatch" notes are obsolete.
74
+ >
75
+ > **Serving note:** this BF16 repo needs a `processor_config.json` to load on vLLM's multimodal path. For the v0.23.0 benchmark we copied it from the `…-K4-FP8` sibling. **We recommend adding `processor_config.json` to this repo** (or downloading it, see below) so it serves out-of-the-box.
76
  >
77
  > ### Single-stream by category (greedy, no spec decode, max_tokens=250)
78
  >
 
122
  > ```
123
 
124
 
125
+ ## Performance — DGX Spark (v0.23.0, aeon-vllm-ultimate:latest)
126
+
127
+ Measured on a single **DGX Spark GB10** (Blackwell sm_121a, unified LPDDR5X) with `ghcr.io/aeon-7/aeon-vllm-ultimate:latest` (vLLM **0.23.0**). This is a **dense 12B** run with **plain decode (no speculative drafter)** — MTP / speculative drafting is net-neutral on GB10 (the per-step drafter + 262k-vocab `lm_head` cost cancels the acceptance win), so the BF16 baseline is run straight. The story here is the **quant-vs-speed tradeoff** (this unquantized BF16 is the slow-but-exact anchor) and **concurrency throughput** (it scales cleanly to c=64).
128
+
129
+ <p align="center"><img src="assets/perf/gemma12b_quant_family.svg" width="100%" alt="Gemma-4-12B K4 quant family: single-stream and c=64 aggregate throughput across BF16, FP8, NVFP4, and NVFP4-FP8 mixed"></p>
130
+
131
+ **Quant-vs-speed:** unquantized BF16 is the baseline at **~7.6 tok/s** single-stream. The quantized siblings are **2.1–2.8× faster** at c=1 — FP8 and NVFP4 ≈16 tok/s (2.1×), the NVFP4+FP8 mixed ≈21 tok/s (2.8×) — for the same weights at a fraction of the memory. Pick BF16 only when you need bit-exact weights, fine-tuning, or non-Blackwell hardware; otherwise the FP8 sibling is near-lossless and the mixed NVFP4+FP8 is smallest + fastest.
132
+
133
+ ### Single-stream by category (c=1, greedy, plain decode)
134
+
135
+ | Category | Decode tok/s | TTFT (ms) | TPOT (ms) | Prefill (tok/s) | DFlash accept |
136
+ |---|---:|---:|---:|---:|---:|
137
+ | Coding | 7.6 | 264 | 131.2 | 190 | — (plain decode) |
138
+ | Math | 7.6 | 268 | 131.4 | 247 | — |
139
+ | Reasoning | 7.6 | 264 | 131.1 | 197 | — |
140
+ | Prose | 7.6 | 264 | 131.5 | 152 | — |
141
+ | Natural language | 7.6 | 267 | 131.4 | 165 | — |
142
+ | Extraction / JSON | 7.8 | 264 | 128.0 | 216 | — |
143
+
144
+ Single-stream decode is **memory-bandwidth-bound** and flat across categories (~7.6 tok/s) — expected for a dense 24 GB BF16 model on the GB10's unified pool. There is **no DFlash/MTP acceptance** to report: this baseline runs plain decode by design (no drafter attached).
145
+
146
+ ### Aggregate throughput by concurrency
147
+
148
+ <p align="center"><img src="assets/perf/sib_g12bf16_concurrency.svg" width="100%" alt="Gemma-4-12B K4-BF16 aggregate throughput vs concurrency (c=1 to c=64) on aeon-vllm-ultimate:latest"></p>
149
+
150
+ Where this BF16 build earns its keep is **batched serving**: aggregate throughput scales near-linearly with concurrency, peaking at **c=64 ≈ 446–458 tok/s** (≈8 tok/s at c=1 → ~70 at c=8 → ~140 at c=16 → ~260 at c=32 → **~450–460 at c=64**). The DFlash high-concurrency fix in this image (below) lets it reach c=64 without the prior crash. Long-context draft-acceptance is not applicable here (plain decode).
151
+
152
+ > No fresh stock-vanilla baseline exists yet for this repo — a same-harness vanilla-vLLM re-bench is pending; the figures above are all on the optimized `aeon-vllm-ultimate:latest`.
153
+
154
+ ## What we fixed for the DGX Spark
155
+
156
+ All AEON models run on one unified container — **`ghcr.io/aeon-7/aeon-vllm-ultimate:latest`** (= `:2026-06-18-v0.23.0-dflashfix`; rollback `:2026-06-11-pr41703`) — **vLLM 0.23.0 built from source for sm_121a** and merged with the AEON speculative-decoding stack, tuned end-to-end for the GB10's unified-memory Blackwell architecture.
157
+
158
+ - **DFlash high-concurrency fix** *(new)* — slices the speculative drafter's KV block-table to the unpadded batch (`block_table[:num_reqs]`). The drafter previously **crashed at ≥32 concurrent requests** (padded-vs-unpadded shape mismatch in FlashAttention); it now scales cleanly to **c=64**. A port of upstream PR #43982 (which fixed this for MTP but never for DFlash). This BF16 baseline runs plain decode so it isn't drafter-bound, but the fix is what lets the whole family hit c=64.
159
+ - **sm_121a-native build** — `TORCH_CUDA_ARCH_LIST=12.1a`, compiling the SM120-family CUTLASS NVFP4/FP8 kernels GB10 actually dispatches to (the speed advantage the quantized siblings enjoy).
160
+ - **Triton NVFP4 KV cache** (PR #44389) — the only 4-bit KV path on sm_121a → ~3× KV capacity per GB of unified memory.
161
+ - **Unified-memory tuning** — conservative KV headroom (GB10 shares one LPDDR5X pool across CPU+GPU) with FULL CUDA graphs + async scheduling.
162
+
163
+ > **Stock baseline:** the optimized figures above are on `aeon-vllm-ultimate:latest` (vLLM 0.23.0). A fresh fully-vanilla stock baseline (default vLLM, no AEON / sm_121a opts) is **pending a re-bench** for this repo and will be added when complete.
164
+
165
  ## Available formats / quant grid
166
 
167
  | Variant | Repo | Precision | Size | Pick when |
 
175
 
176
  ### Recommended runtime
177
 
178
+ Use the unified **[AEON vLLM Ultimate](https://github.com/AEON-7/vllm-ultimate-dgx-spark)** container [`ghcr.io/aeon-7/aeon-vllm-ultimate:latest`](https://github.com/AEON-7/vllm-ultimate-dgx-spark) (vLLM **0.23.0**, sm_121a) — validated and benchmarked on Gemma-4 as of 2026-06-18 (see the Performance section above). The earlier Gemma-4-specific multimodal/`NVFP4_SVD` issues are resolved on the 0.23.0 build; the only serving requirement for this BF16 repo is the `processor_config.json` noted above. `transformers` direct loading also works for non-Blackwell hardware.
 
 
179
 
180
  ## Capability Comparison vs Base
181
 
assets/perf/gemma12b_quant_family.svg ADDED
assets/perf/sib_g12bf16_concurrency.svg ADDED