# AGENTS.md — Gemma-4-12B-it AEON Abliterated K=4 (BF16)

Instructions for autonomous agents on how to **download, load, prompt,
fine-tune, quantize, or evaluate** this model correctly.

## What this model is

A BF16 abliteration of `google/gemma-4-12B-it` produced via
**K=4 multi-direction norm-preserving biprojection** (custom extension
of TrevorJS's K=1 recipe). Refusal behavior is removed; capability is
preserved (≤ ±1% wikitext PPL drift, IFEval and HumanEval syntactic
match base, +6.7pp on HumanEval functional in our N=15 sample).

This is the **canonical full-precision** publish (bit-exact anchor of the
K4 quant family). For Blackwell inference pick a quantized sibling instead —
the **[FP8 sister](https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-FP8)**
(13 GB, near-lossless, matches BF16) for quality, or the
**[mixed NVFP4+FP8 sister](https://huggingface.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-NVFP4-FP8)**
(9.3 GB, smallest + ~2.8× faster at c=1) for size/speed.

## Hard requirements

- **transformers** ≥ 5.10 (model_type is `gemma4_unified`, missing from 5.5.x)
- **torch** ≥ 2.7 with CUDA, BF16 support
- **GPU memory** ≥ 26 GB free for BF16 inference + activations + KV cache (single 24 GB card needs offloading)
- **disk** ≥ 24 GB free

## Download

```bash
pip install --upgrade "transformers>=5.10" "huggingface_hub[cli]"
huggingface-cli download \
  AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 \
  --local-dir ./Gemma-4-12B-AEON-K4-BF16
```

## Load (transformers)

```python
from transformers import AutoTokenizer, Gemma4UnifiedForConditionalGeneration
import torch

REPO = "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16"
tok = AutoTokenizer.from_pretrained(REPO)
model = Gemma4UnifiedForConditionalGeneration.from_pretrained(
    REPO, torch_dtype=torch.bfloat16, device_map="cuda:0"
)
```

> ⚠️ Gemma-4 is multimodal — `apply_chat_template` returns a `BatchEncoding` **dict**, not a tensor. Always unpack with `**enc` (or grab `enc["input_ids"]` explicitly).

## Inference (one-shot)

```python
enc = tok.apply_chat_template(
    [{"role": "user", "content": "Explain quantum error correction in 3 paragraphs."}],
    add_generation_prompt=True, return_tensors="pt", return_dict=True,
)
enc = {k: v.to(model.device) for k, v in enc.items() if hasattr(v, "to")}
out = model.generate(
    **enc, max_new_tokens=512, do_sample=False, pad_token_id=tok.eos_token_id
)
print(tok.decode(out[0][enc["input_ids"].shape[1]:], skip_special_tokens=True))
```

## Serving (vLLM)

Use the unified AEON container on DGX Spark / Blackwell. BF16 runs **plain decode** (no speculative drafter — MTP is net-neutral on GB10) and needs a `processor_config.json` for vLLM's multimodal load path; fetch it from the base model. This recipe matches the 🚀 Quickstart in the README.

```bash
# 1) Pull the unified AEON vLLM container (vLLM 0.23.0, sm_121a)
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest

# 2) Pull this model fresh
huggingface-cli download AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 --local-dir ./aeon-model

# 2b) BF16 multimodal path needs a processor_config.json — fetch from the base model
huggingface-cli download google/gemma-4-12B-it processor_config.json --local-dir ./aeon-model

# 3) (no drafter — BF16 runs plain decode)

# 4) Serve (OpenAI-compatible API on :8000)
docker run -d --name aeon-gemma12b --gpus all --ipc=host --shm-size=16g --net=host \
  -v ./aeon-model:/model:ro \
  --entrypoint vllm \
  ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
  serve /model \
    --attention-backend triton_attn \
    --reasoning-parser gemma4 \
    --tool-call-parser gemma4 \
    --enable-auto-tool-choice \
    --gpu-memory-utilization 0.85 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --trust-remote-code
```

For higher concurrency add `--kv-cache-dtype fp8_e4m3 --max-model-len 8192 --max-num-seqs 16`. Keep `--gpu-memory-utilization` **≤ 0.88** on the Spark (unified LPDDR5X is shared across CPU+GPU; 0.90+ thrashes).

### Container (v0.23.0 truth)

- **Image**: `ghcr.io/aeon-7/aeon-vllm-ultimate:latest` = `:2026-06-18-v0.23.0-dflashfix`; rollback `:2026-06-11-pr41703`. The old lineage (`omni-q36`, `vllm-spark-*`, `aeon-gemma-4-26b-a4b-dflash`, `vllm-aeon-ultimate-*`, `vllm-dflash`) is consolidated into this unified image — those tags are **historical only**. The earlier "fall back to the 0.20.1 image, multimodal is broken" note is **obsolete**: Gemma-4 multimodal + `NVFP4_SVD` issues are resolved on the 0.23.0 build, validated 2026-06-18.
- **vLLM**: 0.23.0 built from source for **sm_121a** (not 0.22.1 / 0.20.1). **FlashInfer 0.6.12** (not 0.6.8.post1).
- **Runtime patches — TWO remain**: `cuda_optional_import` + `cudagraph_align`. The old `kv_cache_utils` patch was **DROPPED** in 0.23.0 (`block_size` is now an int upstream). **PLUS** the new **DFlash high-concurrency block-table fix** (port of upstream PR #43982, which fixed this for MTP but not DFlash) — it slices the drafter KV block-table to the unpadded batch (`block_table[:num_reqs]`) and clears the ≥32-concurrency crash. (This BF16 repo runs plain decode so it isn't drafter-bound, but the fix ships in the same image.)
- **Carried open-upstream PRs**: #44389 (Triton NVFP4 KV cache), #40898 (DFlash SWA), #41703 (Gemma-4 DFlash prefix-cache-safe).

> A direct `vllm serve AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 …` (HF repo id, no docker) also works on non-Blackwell hardware via `transformers` loading; on the Spark prefer the container above.

## Prompt expectations

- **Style**: Inherits Gemma 4 instruct voice — markdown-rich, detail-oriented, English-only reliably.
- **Refusal preamble**: On topics the base would decline, expect a 1–3 sentence disclaimer / context paragraph **before** the requested content. The model **complies** after the preamble. This is a stylistic artifact of single-pass biprojection at scale=1.0 and is not reducible without DPO/SFT.
- **System prompts**: Persona instructions are obeyed; explicit safety-disabling system prompts are not necessary for the model to produce content the base would decline.
- **Sampling**: `do_sample=False` for deterministic eval. For chat, `temperature=0.7, top_p=0.9` matches the base model's behavior closely.

## Behavior notes for agents

- **Tool calling** works via standard Gemma-4 tool-call parser (`--tool-call-parser gemma4` in vLLM). The model produces tool-call JSON without hesitation.
- **Multimodal**: vision/audio encoder paths are unchanged from base — `embed_vision*` and `embed_audio*` are untouched by the abliteration. Image and audio inputs work normally.
- **Long context**: Hybrid sliding-1024 + full-attention layers. KV grows linearly until 1024 tokens then flattens for sliding-only layers. 65k context is comfortable on 80 GB GPUs.
- **Languages**: English is the validated path. The base model's multilingual capability is preserved structurally but the abliteration calibration probes were English-only.

## Fine-tuning

This model is a **legitimate base for further training**. The weight edits are sparse (48 matrices out of ~600) and norm-preserving, so LoRA, QLoRA, full fine-tune, and DPO all work without surprises. Use the same hyperparameters as you would for `google/gemma-4-12B-it`.

```python
# LoRA target modules (standard for Gemma-4):
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
```

## Re-quantization

To reproduce the published NVFP4 SVDQuant variant:

```bash
git clone https://github.com/AEON-7/aeon-abliterations
cd aeon-abliterations/gemma-4-12B
python3 quantize_svdquant.py \
  --src AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 \
  --dst ./Gemma-4-12B-AEON-K4-NVFP4-SVDQuant \
  --calibration-dataset abisee/cnn_dailymail \
  --calibration-samples 2048 \
  --calibration-tokens 1024 \
  --exclude 'lm_head' --exclude 'model.embed_vision*' --exclude 'model.embed_audio*' \
  --device cuda:0
```

Quant takes ~7.5h on DGX Spark GB10. Native sm_121a calibration gives hardware-accurate scale factors.

## Eval suggestions

| Eval | Expected | Command |
|---|---|---|
| Refusal rate (heretic) | 0% | `python -m heretic.evaluator --model <repo>` |
| wikitext PPL drift | -4.22% | `python scripts/eval_ppl.py --base google/gemma-4-12B-it --target <repo>` |
| IFEval (verifiable subset) | 90% | `lm-eval --task ifeval --model hf --model_args pretrained=<repo>` |
| MMLU (full) | within ±2pp of base | `lm-eval --task mmlu --model hf --model_args pretrained=<repo>` |
| HumanEval functional | +6.7pp vs base (N=15 sample) | `bigcode-eval --task humaneval --model <repo>` |

## License + safety

- Inherits the [Gemma license](https://ai.google.dev/gemma/terms).
- Operator-side safety layers are **required** for any production use — see the arbitration clause in the README.
- The absence of refusal behavior means **the duty of care is yours, not the model's**.

## Support the work

Tip-jar addresses in the README. No obligation.