# AGENTS.md — Gemma-4-12B-it AEON Abliterated K=4 (BF16) Instructions for autonomous agents on how to **download, load, prompt, fine-tune, quantize, or evaluate** this model correctly. ## What this model is A BF16 abliteration of `google/gemma-4-12B-it` produced via **K=4 multi-direction norm-preserving biprojection** (custom extension of TrevorJS's K=1 recipe). Refusal behavior is removed; capability is preserved (≤ ±1% wikitext PPL drift, IFEval and HumanEval syntactic match base, +6.7pp on HumanEval functional in our N=15 sample). This is the **canonical full-precision** publish (bit-exact anchor of the K4 quant family). For Blackwell inference pick a quantized sibling instead — the **[FP8 sister](https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8)** (13 GB, near-lossless, matches BF16) for quality, or the **[mixed NVFP4+FP8 sister](https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8)** (9.3 GB, smallest + ~2.8× faster at c=1) for size/speed. ## Hard requirements - **transformers** ≥ 5.10 (model_type is `gemma4_unified`, missing from 5.5.x) - **torch** ≥ 2.7 with CUDA, BF16 support - **GPU memory** ≥ 26 GB free for BF16 inference + activations + KV cache (single 24 GB card needs offloading) - **disk** ≥ 24 GB free ## Download ```bash pip install --upgrade "transformers>=5.10" "huggingface_hub[cli]" huggingface-cli download \ AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 \ --local-dir ./Gemma-4-12B-AEON-K4-BF16 ``` ## Load (transformers) ```python from transformers import AutoTokenizer, Gemma4UnifiedForConditionalGeneration import torch REPO = "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16" tok = AutoTokenizer.from_pretrained(REPO) model = Gemma4UnifiedForConditionalGeneration.from_pretrained( REPO, torch_dtype=torch.bfloat16, device_map="cuda:0" ) ``` > ⚠️ Gemma-4 is multimodal — `apply_chat_template` returns a `BatchEncoding` **dict**, not a tensor. Always unpack with `**enc` (or grab `enc["input_ids"]` explicitly). ## Inference (one-shot) ```python enc = tok.apply_chat_template( [{"role": "user", "content": "Explain quantum error correction in 3 paragraphs."}], add_generation_prompt=True, return_tensors="pt", return_dict=True, ) enc = {k: v.to(model.device) for k, v in enc.items() if hasattr(v, "to")} out = model.generate( **enc, max_new_tokens=512, do_sample=False, pad_token_id=tok.eos_token_id ) print(tok.decode(out[0][enc["input_ids"].shape[1]:], skip_special_tokens=True)) ``` ## Serving (vLLM) Use the unified AEON container on DGX Spark / Blackwell. BF16 runs **plain decode** (no speculative drafter — MTP is net-neutral on GB10) and needs a `processor_config.json` for vLLM's multimodal load path; fetch it from the base model. This recipe matches the 🚀 Quickstart in the README. ```bash # 1) Pull the unified AEON vLLM container (vLLM 0.23.0, sm_121a) docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest # 2) Pull this model fresh huggingface-cli download AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 --local-dir ./aeon-model # 2b) BF16 multimodal path needs a processor_config.json — fetch from the base model huggingface-cli download google/gemma-4-12B-it processor_config.json --local-dir ./aeon-model # 3) (no drafter — BF16 runs plain decode) # 4) Serve (OpenAI-compatible API on :8000) docker run -d --name aeon-gemma12b --gpus all --ipc=host --shm-size=16g --net=host \ -v ./aeon-model:/model:ro \ --entrypoint vllm \ ghcr.io/aeon-7/aeon-vllm-ultimate:latest \ serve /model \ --attention-backend triton_attn \ --reasoning-parser gemma4 \ --tool-call-parser gemma4 \ --enable-auto-tool-choice \ --gpu-memory-utilization 0.85 \ --enable-chunked-prefill \ --enable-prefix-caching \ --trust-remote-code ``` For higher concurrency add `--kv-cache-dtype fp8_e4m3 --max-model-len 8192 --max-num-seqs 16`. Keep `--gpu-memory-utilization` **≤ 0.88** on the Spark (unified LPDDR5X is shared across CPU+GPU; 0.90+ thrashes). ### Container (v0.23.0 truth) - **Image**: `ghcr.io/aeon-7/aeon-vllm-ultimate:latest` = `:2026-06-18-v0.23.0-dflashfix`; rollback `:2026-06-11-pr41703`. The old lineage (`omni-q36`, `vllm-spark-*`, `aeon-gemma-4-26b-a4b-dflash`, `vllm-aeon-ultimate-*`, `vllm-dflash`) is consolidated into this unified image — those tags are **historical only**. The earlier "fall back to the 0.20.1 image, multimodal is broken" note is **obsolete**: Gemma-4 multimodal + `NVFP4_SVD` issues are resolved on the 0.23.0 build, validated 2026-06-18. - **vLLM**: 0.23.0 built from source for **sm_121a** (not 0.22.1 / 0.20.1). **FlashInfer 0.6.12** (not 0.6.8.post1). - **Runtime patches — TWO remain**: `cuda_optional_import` + `cudagraph_align`. The old `kv_cache_utils` patch was **DROPPED** in 0.23.0 (`block_size` is now an int upstream). **PLUS** the new **DFlash high-concurrency block-table fix** (port of upstream PR #43982, which fixed this for MTP but not DFlash) — it slices the drafter KV block-table to the unpadded batch (`block_table[:num_reqs]`) and clears the ≥32-concurrency crash. (This BF16 repo runs plain decode so it isn't drafter-bound, but the fix ships in the same image.) - **Carried open-upstream PRs**: #44389 (Triton NVFP4 KV cache), #40898 (DFlash SWA), #41703 (Gemma-4 DFlash prefix-cache-safe). > A direct `vllm serve AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 …` (HF repo id, no docker) also works on non-Blackwell hardware via `transformers` loading; on the Spark prefer the container above. ## Prompt expectations - **Style**: Inherits Gemma 4 instruct voice — markdown-rich, detail-oriented, English-only reliably. - **Refusal preamble**: On topics the base would decline, expect a 1–3 sentence disclaimer / context paragraph **before** the requested content. The model **complies** after the preamble. This is a stylistic artifact of single-pass biprojection at scale=1.0 and is not reducible without DPO/SFT. - **System prompts**: Persona instructions are obeyed; explicit safety-disabling system prompts are not necessary for the model to produce content the base would decline. - **Sampling**: `do_sample=False` for deterministic eval. For chat, `temperature=0.7, top_p=0.9` matches the base model's behavior closely. ## Behavior notes for agents - **Tool calling** works via standard Gemma-4 tool-call parser (`--tool-call-parser gemma4` in vLLM). The model produces tool-call JSON without hesitation. - **Multimodal**: vision/audio encoder paths are unchanged from base — `embed_vision*` and `embed_audio*` are untouched by the abliteration. Image and audio inputs work normally. - **Long context**: Hybrid sliding-1024 + full-attention layers. KV grows linearly until 1024 tokens then flattens for sliding-only layers. 65k context is comfortable on 80 GB GPUs. - **Languages**: English is the validated path. The base model's multilingual capability is preserved structurally but the abliteration calibration probes were English-only. ## Fine-tuning This model is a **legitimate base for further training**. The weight edits are sparse (48 matrices out of ~600) and norm-preserving, so LoRA, QLoRA, full fine-tune, and DPO all work without surprises. Use the same hyperparameters as you would for `google/gemma-4-12B-it`. ```python # LoRA target modules (standard for Gemma-4): target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] ``` ## Re-quantization To reproduce the published NVFP4 SVDQuant variant: ```bash git clone https://github.com/AEON-7/aeon-abliterations cd aeon-abliterations/gemma-4-12B python3 quantize_svdquant.py \ --src AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 \ --dst ./Gemma-4-12B-AEON-K4-NVFP4-SVDQuant \ --calibration-dataset abisee/cnn_dailymail \ --calibration-samples 2048 \ --calibration-tokens 1024 \ --exclude 'lm_head' --exclude 'model.embed_vision*' --exclude 'model.embed_audio*' \ --device cuda:0 ``` Quant takes ~7.5h on DGX Spark GB10. Native sm_121a calibration gives hardware-accurate scale factors. ## Eval suggestions | Eval | Expected | Command | |---|---|---| | Refusal rate (heretic) | 0% | `python -m heretic.evaluator --model ` | | wikitext PPL drift | -4.22% | `python scripts/eval_ppl.py --base google/gemma-4-12B-it --target ` | | IFEval (verifiable subset) | 90% | `lm-eval --task ifeval --model hf --model_args pretrained=` | | MMLU (full) | within ±2pp of base | `lm-eval --task mmlu --model hf --model_args pretrained=` | | HumanEval functional | +6.7pp vs base (N=15 sample) | `bigcode-eval --task humaneval --model ` | ## License + safety - Inherits the [Gemma license](https://ai.google.dev/gemma/terms). - Operator-side safety layers are **required** for any production use — see the arbitration clause in the README. - The absence of refusal behavior means **the duty of care is yours, not the model's**. ## Support the work Tip-jar addresses in the README. No obligation.