Instructions to use AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4

SGLang

How to use AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4 with Docker Model Runner:
```
docker model run hf.co/AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4
```

Gemma 4 E4B DECKARD HERETIC Uncensored NVFP4

EAGLE speculative decoding drafter for Gemma 4 31B DECKARD HERETIC Uncensored NVFP4.

A 42-layer E4B (EAGLE for Blackwell) model quantized to NVFP4 AWQ using NVIDIA ModelOpt 0.42.0. Designed for EAGLE-based speculative decoding on NVIDIA DGX Spark (GB10, SM 12.1) and other Blackwell GPUs.

Model Details

Property	Value
Architecture	Gemma 4 (E4B EAGLE Drafter)
Target Model	AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4
Layers	42 (35 sliding-window + 7 full-attention)
Hidden Size	2560
Attention Heads	8 (2 KV heads), head_dim=256, global_head_dim=512
Sliding Window	512 tokens
Max Context	131,072 tokens
Quantization	NVFP4 AWQ (ModelOpt 0.42.0)
Model Size	9.6 GB
Vocabulary	262,144 tokens

Performance (DGX Spark)

Benchmarked on NVIDIA DGX Spark (GB10, SM 12.1, 128 GB unified memory) with 31B DECKARD AWQ_FULL target + this E4B drafter. 5 speculative tokens, 300 max tokens per request.

Concurrent	Aggregate tok/s	Per-Request tok/s	Avg Latency (300 tok)
1	7.6	8.9	39.4s
2	21.7	10.8	27.7s
4	42.7	10.7	28.1s

Zero errors across all test runs. Throughput scales linearly with concurrency.

Quick Start

1. Download both models

pip install -U huggingface-hub

# Target model (31B)
huggingface-cli download AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4 \
  --local-dir ~/models/deckard-31b

# This drafter model (E4B)
huggingface-cli download AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4 \
  --local-dir ~/models/e4b-drafter

2. Get the patched vLLM files

Three patches are required for Gemma 4 speculative decoding. Download from the GitHub repo:

for f in eagle_patched.py serving_chat_patched.py modelopt_patched.py; do
  curl -LO https://raw.githubusercontent.com/AEON-7/Gemma-4-31B-DECKARD-HERETIC-Uncensored-NVFP4/main/$f
done

3. Launch with Docker Compose

services:
  vllm:
    image: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest
    container_name: vllm-deckard-31b-spec
    restart: unless-stopped
    network_mode: host
    volumes:
      - ~/models/deckard-31b:/models/deckard
      - ~/models/e4b-drafter:/models/e4b-drafter
      - ./modelopt_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/modelopt.py
      - ./serving_chat_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py
      - ./eagle_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/v1/spec_decode/eagle.py
    environment:
      - VLLM_TEST_FORCE_FP8_MARLIN=1
      - VLLM_MARLIN_USE_ATOMIC_ADD=1
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - TORCH_MATMUL_PRECISION=high
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    command:
      - bash
      - -c
      - |
        exec vllm serve /models/deckard \
          --served-model-name deckard-31b \
          --quantization modelopt \
          --dtype auto \
          --kv-cache-dtype fp8 \
          --tensor-parallel-size 1 \
          --max-model-len 131072 \
          --max-num-seqs 4 \
          --gpu-memory-utilization 0.85 \
          --trust-remote-code \
          --host 0.0.0.0 --port 8000 \
          --enable-chunked-prefill \
          --enable-prefix-caching \
          --enable-auto-tool-choice \
          --tool-call-parser gemma4 \
          --reasoning-parser gemma4 \
          --speculative-config '{"method":"draft_model","model":"/models/e4b-drafter","num_speculative_tokens":5,"quantization":"modelopt"}'
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

4. Test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deckard-31b",
    "messages": [{"role": "user", "content": "Explain quantum entanglement."}],
    "max_tokens": 200
  }'

Required vLLM Patches

Three patches to vLLM 0.19.1 are required for speculative decoding with Gemma 4. All are available in the target model GitHub repo.

Patch	What it fixes
`eagle_patched.py`	Removes multimodal spec decode guard, adds Gemma4 model whitelist, supports multi-group KV cache (heterogeneous head_dim=256/512)
`serving_chat_patched.py`	Fixes non-streaming reasoning parser — `<\|channel>` tokens stripped by `skip_special_tokens=True`
`modelopt_patched.py`	NVFP4_AWQ quant_algo support, AWQ pre_quant_scale handling, FP8 NaN scrubbing

Heterogeneous Attention

This E4B drafter mirrors the Gemma 4 heterogeneous attention design:

35 sliding-window layers — head_dim=256, window of 512 tokens, default RoPE (theta=10000)
7 full-attention layers — head_dim=512, global attention, proportional RoPE (theta=1M, partial_rotary_factor=0.25)

This creates two distinct KV cache groups, handled by the eagle_patched.py multi-group KV cache fix.

Related Models

Model	Type	Size	Link
Gemma 4 31B DECKARD AWQ_FULL (target)	Dense NVFP4	20.5 GB	HuggingFace \| GitHub
Gemma 4 31B DECKARD SVDQuant	Dense NVFP4	20.9 GB	HuggingFace
SuperGemma4 26B MoE	MoE NVFP4	15.3 GB	HuggingFace
vLLM AWQ Container	Docker	—	GHCR