Instructions to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16")
model = AutoModelForMultimodalLM.from_pretrained("AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16

SGLang

How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 with Docker Model Runner:
```
docker model run hf.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16
```

AEON-7 Claude Opus 4.8 (1M context) commited on 1 day ago

Commit

14c6fe4

verified ·

1 Parent(s): 593810f

docs(perf): v0.23.0 benchmarks on aeon-vllm-ultimate:latest + charts

Browse files

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Files changed (3) hide show

README.md +45 -5
assets/perf/gemma12b_quant_family.svg +39 -0
assets/perf/sib_g12bf16_concurrency.svg +63 -0

README.md CHANGED Viewed

@@ -68,9 +68,11 @@ tags:
 # Gemma-4-12B-it AEON Abliterated — K=4 Multi-Direction Biprojection (BF16)
-> ## ✅ Verified working in native vLLM (2026-06-05)
 >
-> Smoke-tested end-to-end via vLLM's native `Gemma4UnifiedForConditionalGeneration` loader in `ghcr.io/aeon-7/aeon-vllm-ultimate:v0.22.2-pr44389-aeon-spark-gemma4unified` on DGX Spark GB10. The loader for Google's encoder-free Gemma-4-12B was added upstream in `gemma4_unified.py` (after the initial PR #38826 merge). Our previous "blocked on architecture mismatch" notes are obsolete.
 >
 > ### Single-stream by category (greedy, no spec decode, max_tokens=250)
 >
@@ -120,6 +122,46 @@ tags:
 > ```
 ## Available formats / quant grid
 | Variant | Repo | Precision | Size | Pick when |
@@ -133,9 +175,7 @@ tags:
 ### Recommended runtime
-For **Gemma-4 production today, use the previous AEON-7 image** [`ghcr.io/aeon-7/aeon-gemma-4-26b-a4b-dflash:latest`](https://github.com/AEON-7/packages?repo_name=aeon-gemma-4-26b-a4b-dflash) (vLLM 0.20.1, known-good for the multimodal `Gemma4UnifiedForConditionalGeneration` path) or load with `transformers` directly.
-> ⚠️ The newer **[AEON vLLM Ultimate](https://github.com/AEON-7/vllm-ultimate-dgx-spark)** container (`ghcr.io/aeon-7/aeon-vllm-ultimate:latest`) bundles vLLM 0.22.1 + PR #44389 (NVFP4 KV cache, 3× capacity) + DFlash + TurboQuant for our **Qwen3.6 family** — but as of 2026-06-04 it has **upstream PR #44389 bugs on Gemma-4**: the multimodal fallback hits a shape mismatch in `Gemma4UnifiedForConditionalGeneration`, and the modelopt path doesn't yet recognize `NVFP4_SVD`. A patched tag will be published when upstream merges fixes. Track the container's [Known issues](https://github.com/AEON-7/vllm-ultimate-dgx-spark#known-issues).
 ## Capability Comparison vs Base

 # Gemma-4-12B-it AEON Abliterated — K=4 Multi-Direction Biprojection (BF16)
+> ## ✅ Verified working in native vLLM (re-validated v0.23.0, 2026-06-18)
 >
+> Smoke-tested and benchmarked end-to-end via vLLM's native `Gemma4UnifiedForConditionalGeneration` loader on the unified **`ghcr.io/aeon-7/aeon-vllm-ultimate:latest`** image (vLLM **0.23.0**, sm_121a) on DGX Spark GB10. The loader for Google's encoder-free Gemma-4-12B was added upstream in `gemma4_unified.py` (after the initial PR #38826 merge). Our previous "blocked on architecture mismatch" notes are obsolete.
+>
+> **Serving note:** this BF16 repo needs a `processor_config.json` to load on vLLM's multimodal path. For the v0.23.0 benchmark we copied it from the `…-K4-FP8` sibling. **We recommend adding `processor_config.json` to this repo** (or downloading it, see below) so it serves out-of-the-box.
 >
 > ### Single-stream by category (greedy, no spec decode, max_tokens=250)
 >
 > ```
+## Performance — DGX Spark (v0.23.0, aeon-vllm-ultimate:latest)
+Measured on a single **DGX Spark GB10** (Blackwell sm_121a, unified LPDDR5X) with `ghcr.io/aeon-7/aeon-vllm-ultimate:latest` (vLLM **0.23.0**). This is a **dense 12B** run with **plain decode (no speculative drafter)** — MTP / speculative drafting is net-neutral on GB10 (the per-step drafter + 262k-vocab `lm_head` cost cancels the acceptance win), so the BF16 baseline is run straight. The story here is the **quant-vs-speed tradeoff** (this unquantized BF16 is the slow-but-exact anchor) and **concurrency throughput** (it scales cleanly to c=64).
+<p align="center"><img src="assets/perf/gemma12b_quant_family.svg" width="100%" alt="Gemma-4-12B K4 quant family: single-stream and c=64 aggregate throughput across BF16, FP8, NVFP4, and NVFP4-FP8 mixed"></p>
+**Quant-vs-speed:** unquantized BF16 is the baseline at **~7.6 tok/s** single-stream. The quantized siblings are **2.1–2.8× faster** at c=1 — FP8 and NVFP4 ≈16 tok/s (2.1×), the NVFP4+FP8 mixed ≈21 tok/s (2.8×) — for the same weights at a fraction of the memory. Pick BF16 only when you need bit-exact weights, fine-tuning, or non-Blackwell hardware; otherwise the FP8 sibling is near-lossless and the mixed NVFP4+FP8 is smallest + fastest.
+### Single-stream by category (c=1, greedy, plain decode)
+| Category | Decode tok/s | TTFT (ms) | TPOT (ms) | Prefill (tok/s) | DFlash accept |
+|---|---:|---:|---:|---:|---:|
+| Coding | 7.6 | 264 | 131.2 | 190 | — (plain decode) |
+| Math | 7.6 | 268 | 131.4 | 247 | — |
+| Reasoning | 7.6 | 264 | 131.1 | 197 | — |
+| Prose | 7.6 | 264 | 131.5 | 152 | — |
+| Natural language | 7.6 | 267 | 131.4 | 165 | — |
+| Extraction / JSON | 7.8 | 264 | 128.0 | 216 | — |
+Single-stream decode is **memory-bandwidth-bound** and flat across categories (~7.6 tok/s) — expected for a dense 24 GB BF16 model on the GB10's unified pool. There is **no DFlash/MTP acceptance** to report: this baseline runs plain decode by design (no drafter attached).
+### Aggregate throughput by concurrency
+<p align="center"><img src="assets/perf/sib_g12bf16_concurrency.svg" width="100%" alt="Gemma-4-12B K4-BF16 aggregate throughput vs concurrency (c=1 to c=64) on aeon-vllm-ultimate:latest"></p>
+Where this BF16 build earns its keep is **batched serving**: aggregate throughput scales near-linearly with concurrency, peaking at **c=64 ≈ 446–458 tok/s** (≈8 tok/s at c=1 → ~70 at c=8 → ~140 at c=16 → ~260 at c=32 → **~450–460 at c=64**). The DFlash high-concurrency fix in this image (below) lets it reach c=64 without the prior crash. Long-context draft-acceptance is not applicable here (plain decode).
+> No fresh stock-vanilla baseline exists yet for this repo — a same-harness vanilla-vLLM re-bench is pending; the figures above are all on the optimized `aeon-vllm-ultimate:latest`.
+## What we fixed for the DGX Spark
+All AEON models run on one unified container — **`ghcr.io/aeon-7/aeon-vllm-ultimate:latest`** (= `:2026-06-18-v0.23.0-dflashfix`; rollback `:2026-06-11-pr41703`) — **vLLM 0.23.0 built from source for sm_121a** and merged with the AEON speculative-decoding stack, tuned end-to-end for the GB10's unified-memory Blackwell architecture.
+- **DFlash high-concurrency fix** *(new)* — slices the speculative drafter's KV block-table to the unpadded batch (`block_table[:num_reqs]`). The drafter previously **crashed at ≥32 concurrent requests** (padded-vs-unpadded shape mismatch in FlashAttention); it now scales cleanly to **c=64**. A port of upstream PR #43982 (which fixed this for MTP but never for DFlash). This BF16 baseline runs plain decode so it isn't drafter-bound, but the fix is what lets the whole family hit c=64.
+- **sm_121a-native build** — `TORCH_CUDA_ARCH_LIST=12.1a`, compiling the SM120-family CUTLASS NVFP4/FP8 kernels GB10 actually dispatches to (the speed advantage the quantized siblings enjoy).
+- **Triton NVFP4 KV cache** (PR #44389) — the only 4-bit KV path on sm_121a → ~3× KV capacity per GB of unified memory.
+- **Unified-memory tuning** — conservative KV headroom (GB10 shares one LPDDR5X pool across CPU+GPU) with FULL CUDA graphs + async scheduling.
+> **Stock baseline:** the optimized figures above are on `aeon-vllm-ultimate:latest` (vLLM 0.23.0). A fresh fully-vanilla stock baseline (default vLLM, no AEON / sm_121a opts) is **pending a re-bench** for this repo and will be added when complete.
 ## Available formats / quant grid
 | Variant | Repo | Precision | Size | Pick when |
 ### Recommended runtime
+Use the unified **[AEON vLLM Ultimate](https://github.com/AEON-7/vllm-ultimate-dgx-spark)** container [`ghcr.io/aeon-7/aeon-vllm-ultimate:latest`](https://github.com/AEON-7/vllm-ultimate-dgx-spark) (vLLM **0.23.0**, sm_121a) — validated and benchmarked on Gemma-4 as of 2026-06-18 (see the Performance section above). The earlier Gemma-4-specific multimodal/`NVFP4_SVD` issues are resolved on the 0.23.0 build; the only serving requirement for this BF16 repo is the `processor_config.json` noted above. `transformers` direct loading also works for non-Blackwell hardware.
 ## Capability Comparison vs Base

assets/perf/gemma12b_quant_family.svg ADDED Viewed

assets/perf/sib_g12bf16_concurrency.svg ADDED Viewed