Gemma 4 E4B — Text-Only NVFP4 (modelopt checkpoint)

NVFP4 quantization of Gemma 4 E4B's text decoder, produced via nvidia-modelopt. Hardware-agnostic checkpoint; inference requires NVIDIA Blackwell GPUs (RTX 50xx, B100/B200, GB200) via TensorRT-LLM.

What's in this repo

config.json              # Gemma4ForCausalLM, NVFP4 quantization metadata
generation_config.json
tokenizer.json + tokenizer_config.json + chat_template.jinja
model.safetensors        # NVFP4 weights (~5-6 GB)

This is the modelopt checkpoint, not a TRT-LLM engine.

Build the engine on your Blackwell GPU

# Download
git lfs install
git clone https://huggingface.co/tss-deposium/gemma-4-E4B-text-only-nvfp4
cd gemma-4-E4B-text-only-nvfp4

# Validate before the full build (~30s) — cheap signal for compatibility
trtllm-build --checkpoint_dir . --output_dir /tmp/dryrun \
    --dry_run --log_level debug

# Full engine build (10-30 min on RTX 50xx)
trtllm-build --checkpoint_dir . \
    --output_dir ./engine \
    --gemm_plugin nvfp4 \
    --max_batch_size 4 --max_input_len 4096 --max_seq_len 5120 \
    --use_paged_context_fmha enable

Inference

from tensorrt_llm.runtime import ModelRunner
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("tss-deposium/gemma-4-E4B-text-only-nvfp4")
runner = ModelRunner.from_dir("./engine")

prompt = tok.apply_chat_template(
    [{"role": "user", "content": "Quelle est la capitale de la France ?"}],
    tokenize=False, add_generation_prompt=True,
)
ids = tok(prompt, return_tensors="pt").input_ids.cuda()
out = runner.generate(ids, max_new_tokens=64)
print(tok.decode(out[0][0]))

Caveats — read before adopting

Blackwell required: NVFP4 is hardware-accelerated only on RTX 50xx, B100/B200, GB200. On older GPUs, FP4 ops fall back to FP16 simulation, losing the speedup.
Format mobility: NVFP4 checkpoint format may change between modelopt minor releases. Pin your modelopt version to match this checkpoint's source notebook (see Provenance below).
Gemma 4 in NVFP4 is experimental: as of 2026-05, Gemma 4 is not in NVIDIA's official NVFP4 support matrix. modelopt + trtllm-build may regress on future updates.
Calibration corpus: ~140 multilingual prompts (FR/EN/ES/DE/IT/PT/RU/JA/extraction/JSON/code/long-context). If your inference distribution differs significantly, recalibrate from FP16 with your own corpus.

When to use this vs `tss-deposium/gemma-4-E4B-text-only-onnx-int4`

	This repo (NVFP4)	Sibling INT4 ONNX
Hardware	Blackwell only	Any GPU + CPU fallback
Stack	TensorRT-LLM	ONNX Runtime
Vitesse	1.5-3× INT4 ONNX on Blackwell	baseline
Portabilité	Self-hosted RTX 50xx only	Linux/Docker/Railway/cross-OS
Quality	~97-99% MMLU	~95-97% MMLU

If you're not on Blackwell, or you need cross-platform deployment, use the INT4 ONNX repo.

Provenance

Author: Nicolas Geysse — The Seed Ship (Deposium project, theseedship/deposium-turbov3)
Source model: google/gemma-4-E4B-it (multimodal — text decoder loaded directly via AutoModelForCausalLM)
Quantization: nvidia-modelopt NVFP4_DEFAULT_CFG
Pipeline: docs/gemma4_e4b_nvfp4_modelopt_export.ipynb
License: Gemma terms of use (inherited)

Downloads last month: 5

Safetensors

Model size

6B params

Tensor type

F16

F8_E4M3

Model tree for tss-deposium/gemma-4-E4B-text-only-nvfp4

Base model

google/gemma-4-E4B

Finetuned

google/gemma-4-E4B-it

Quantized

(232)

this model