Text Generation
Transformers
Safetensors
English
gemma4_unified
image-text-to-text
gemma4
gemma
google
gemma-4-12B
dense
transformer
12b
multimodal-capable
uncensored
abliterated
unfiltered
refusal-removed
biprojection
multi-direction-biprojection
k4-biprojection
aeon
aeon-7
trevor-js
heretic
chat
instruct
reasoning
coding
tool-calling
function-calling
agentic
bf16
bfloat16
dgx-spark
blackwell
gb10
grace-blackwell
nvidia
gpu
arm64
aarch64
vllm
openai-api
openai-compatible
low-drift
capability-preserving
prefix-caching
chunked-prefill
english
production-ready
conversational
Instructions to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16") model = AutoModelForMultimodalLM.from_pretrained("AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16
- SGLang
How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 with Docker Model Runner:
docker model run hf.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16
docs(perf): v0.23.0 benchmarks on aeon-vllm-ultimate:latest + charts
Browse filesCo-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- README.md +45 -5
- assets/perf/gemma12b_quant_family.svg +39 -0
- assets/perf/sib_g12bf16_concurrency.svg +63 -0
README.md
CHANGED
|
@@ -68,9 +68,11 @@ tags:
|
|
| 68 |
|
| 69 |
# Gemma-4-12B-it AEON Abliterated — K=4 Multi-Direction Biprojection (BF16)
|
| 70 |
|
| 71 |
-
> ## ✅ Verified working in native vLLM (2026-06-
|
| 72 |
>
|
| 73 |
-
> Smoke-tested end-to-end via vLLM's native `Gemma4UnifiedForConditionalGeneration` loader
|
|
|
|
|
|
|
| 74 |
>
|
| 75 |
> ### Single-stream by category (greedy, no spec decode, max_tokens=250)
|
| 76 |
>
|
|
@@ -120,6 +122,46 @@ tags:
|
|
| 120 |
> ```
|
| 121 |
|
| 122 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
## Available formats / quant grid
|
| 124 |
|
| 125 |
| Variant | Repo | Precision | Size | Pick when |
|
|
@@ -133,9 +175,7 @@ tags:
|
|
| 133 |
|
| 134 |
### Recommended runtime
|
| 135 |
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
> ⚠️ The newer **[AEON vLLM Ultimate](https://github.com/AEON-7/vllm-ultimate-dgx-spark)** container (`ghcr.io/aeon-7/aeon-vllm-ultimate:latest`) bundles vLLM 0.22.1 + PR #44389 (NVFP4 KV cache, 3× capacity) + DFlash + TurboQuant for our **Qwen3.6 family** — but as of 2026-06-04 it has **upstream PR #44389 bugs on Gemma-4**: the multimodal fallback hits a shape mismatch in `Gemma4UnifiedForConditionalGeneration`, and the modelopt path doesn't yet recognize `NVFP4_SVD`. A patched tag will be published when upstream merges fixes. Track the container's [Known issues](https://github.com/AEON-7/vllm-ultimate-dgx-spark#known-issues).
|
| 139 |
|
| 140 |
## Capability Comparison vs Base
|
| 141 |
|
|
|
|
| 68 |
|
| 69 |
# Gemma-4-12B-it AEON Abliterated — K=4 Multi-Direction Biprojection (BF16)
|
| 70 |
|
| 71 |
+
> ## ✅ Verified working in native vLLM (re-validated v0.23.0, 2026-06-18)
|
| 72 |
>
|
| 73 |
+
> Smoke-tested and benchmarked end-to-end via vLLM's native `Gemma4UnifiedForConditionalGeneration` loader on the unified **`ghcr.io/aeon-7/aeon-vllm-ultimate:latest`** image (vLLM **0.23.0**, sm_121a) on DGX Spark GB10. The loader for Google's encoder-free Gemma-4-12B was added upstream in `gemma4_unified.py` (after the initial PR #38826 merge). Our previous "blocked on architecture mismatch" notes are obsolete.
|
| 74 |
+
>
|
| 75 |
+
> **Serving note:** this BF16 repo needs a `processor_config.json` to load on vLLM's multimodal path. For the v0.23.0 benchmark we copied it from the `…-K4-FP8` sibling. **We recommend adding `processor_config.json` to this repo** (or downloading it, see below) so it serves out-of-the-box.
|
| 76 |
>
|
| 77 |
> ### Single-stream by category (greedy, no spec decode, max_tokens=250)
|
| 78 |
>
|
|
|
|
| 122 |
> ```
|
| 123 |
|
| 124 |
|
| 125 |
+
## Performance — DGX Spark (v0.23.0, aeon-vllm-ultimate:latest)
|
| 126 |
+
|
| 127 |
+
Measured on a single **DGX Spark GB10** (Blackwell sm_121a, unified LPDDR5X) with `ghcr.io/aeon-7/aeon-vllm-ultimate:latest` (vLLM **0.23.0**). This is a **dense 12B** run with **plain decode (no speculative drafter)** — MTP / speculative drafting is net-neutral on GB10 (the per-step drafter + 262k-vocab `lm_head` cost cancels the acceptance win), so the BF16 baseline is run straight. The story here is the **quant-vs-speed tradeoff** (this unquantized BF16 is the slow-but-exact anchor) and **concurrency throughput** (it scales cleanly to c=64).
|
| 128 |
+
|
| 129 |
+
<p align="center"><img src="assets/perf/gemma12b_quant_family.svg" width="100%" alt="Gemma-4-12B K4 quant family: single-stream and c=64 aggregate throughput across BF16, FP8, NVFP4, and NVFP4-FP8 mixed"></p>
|
| 130 |
+
|
| 131 |
+
**Quant-vs-speed:** unquantized BF16 is the baseline at **~7.6 tok/s** single-stream. The quantized siblings are **2.1–2.8× faster** at c=1 — FP8 and NVFP4 ≈16 tok/s (2.1×), the NVFP4+FP8 mixed ≈21 tok/s (2.8×) — for the same weights at a fraction of the memory. Pick BF16 only when you need bit-exact weights, fine-tuning, or non-Blackwell hardware; otherwise the FP8 sibling is near-lossless and the mixed NVFP4+FP8 is smallest + fastest.
|
| 132 |
+
|
| 133 |
+
### Single-stream by category (c=1, greedy, plain decode)
|
| 134 |
+
|
| 135 |
+
| Category | Decode tok/s | TTFT (ms) | TPOT (ms) | Prefill (tok/s) | DFlash accept |
|
| 136 |
+
|---|---:|---:|---:|---:|---:|
|
| 137 |
+
| Coding | 7.6 | 264 | 131.2 | 190 | — (plain decode) |
|
| 138 |
+
| Math | 7.6 | 268 | 131.4 | 247 | — |
|
| 139 |
+
| Reasoning | 7.6 | 264 | 131.1 | 197 | — |
|
| 140 |
+
| Prose | 7.6 | 264 | 131.5 | 152 | — |
|
| 141 |
+
| Natural language | 7.6 | 267 | 131.4 | 165 | — |
|
| 142 |
+
| Extraction / JSON | 7.8 | 264 | 128.0 | 216 | — |
|
| 143 |
+
|
| 144 |
+
Single-stream decode is **memory-bandwidth-bound** and flat across categories (~7.6 tok/s) — expected for a dense 24 GB BF16 model on the GB10's unified pool. There is **no DFlash/MTP acceptance** to report: this baseline runs plain decode by design (no drafter attached).
|
| 145 |
+
|
| 146 |
+
### Aggregate throughput by concurrency
|
| 147 |
+
|
| 148 |
+
<p align="center"><img src="assets/perf/sib_g12bf16_concurrency.svg" width="100%" alt="Gemma-4-12B K4-BF16 aggregate throughput vs concurrency (c=1 to c=64) on aeon-vllm-ultimate:latest"></p>
|
| 149 |
+
|
| 150 |
+
Where this BF16 build earns its keep is **batched serving**: aggregate throughput scales near-linearly with concurrency, peaking at **c=64 ≈ 446–458 tok/s** (≈8 tok/s at c=1 → ~70 at c=8 → ~140 at c=16 → ~260 at c=32 → **~450–460 at c=64**). The DFlash high-concurrency fix in this image (below) lets it reach c=64 without the prior crash. Long-context draft-acceptance is not applicable here (plain decode).
|
| 151 |
+
|
| 152 |
+
> No fresh stock-vanilla baseline exists yet for this repo — a same-harness vanilla-vLLM re-bench is pending; the figures above are all on the optimized `aeon-vllm-ultimate:latest`.
|
| 153 |
+
|
| 154 |
+
## What we fixed for the DGX Spark
|
| 155 |
+
|
| 156 |
+
All AEON models run on one unified container — **`ghcr.io/aeon-7/aeon-vllm-ultimate:latest`** (= `:2026-06-18-v0.23.0-dflashfix`; rollback `:2026-06-11-pr41703`) — **vLLM 0.23.0 built from source for sm_121a** and merged with the AEON speculative-decoding stack, tuned end-to-end for the GB10's unified-memory Blackwell architecture.
|
| 157 |
+
|
| 158 |
+
- **DFlash high-concurrency fix** *(new)* — slices the speculative drafter's KV block-table to the unpadded batch (`block_table[:num_reqs]`). The drafter previously **crashed at ≥32 concurrent requests** (padded-vs-unpadded shape mismatch in FlashAttention); it now scales cleanly to **c=64**. A port of upstream PR #43982 (which fixed this for MTP but never for DFlash). This BF16 baseline runs plain decode so it isn't drafter-bound, but the fix is what lets the whole family hit c=64.
|
| 159 |
+
- **sm_121a-native build** — `TORCH_CUDA_ARCH_LIST=12.1a`, compiling the SM120-family CUTLASS NVFP4/FP8 kernels GB10 actually dispatches to (the speed advantage the quantized siblings enjoy).
|
| 160 |
+
- **Triton NVFP4 KV cache** (PR #44389) — the only 4-bit KV path on sm_121a → ~3× KV capacity per GB of unified memory.
|
| 161 |
+
- **Unified-memory tuning** — conservative KV headroom (GB10 shares one LPDDR5X pool across CPU+GPU) with FULL CUDA graphs + async scheduling.
|
| 162 |
+
|
| 163 |
+
> **Stock baseline:** the optimized figures above are on `aeon-vllm-ultimate:latest` (vLLM 0.23.0). A fresh fully-vanilla stock baseline (default vLLM, no AEON / sm_121a opts) is **pending a re-bench** for this repo and will be added when complete.
|
| 164 |
+
|
| 165 |
## Available formats / quant grid
|
| 166 |
|
| 167 |
| Variant | Repo | Precision | Size | Pick when |
|
|
|
|
| 175 |
|
| 176 |
### Recommended runtime
|
| 177 |
|
| 178 |
+
Use the unified **[AEON vLLM Ultimate](https://github.com/AEON-7/vllm-ultimate-dgx-spark)** container [`ghcr.io/aeon-7/aeon-vllm-ultimate:latest`](https://github.com/AEON-7/vllm-ultimate-dgx-spark) (vLLM **0.23.0**, sm_121a) — validated and benchmarked on Gemma-4 as of 2026-06-18 (see the Performance section above). The earlier Gemma-4-specific multimodal/`NVFP4_SVD` issues are resolved on the 0.23.0 build; the only serving requirement for this BF16 repo is the `processor_config.json` noted above. `transformers` direct loading also works for non-Blackwell hardware.
|
|
|
|
|
|
|
| 179 |
|
| 180 |
## Capability Comparison vs Base
|
| 181 |
|
assets/perf/gemma12b_quant_family.svg
ADDED
|
|
assets/perf/sib_g12bf16_concurrency.svg
ADDED
|
|