--- license: gemma tags: - gemma - gemma-4 - text-generation - nvfp4 - modelopt - tensorrt-llm - int4 base_model: - google/gemma-4-E4B-it --- # Gemma 4 E4B — Text-Only NVFP4 (modelopt checkpoint) NVFP4 quantization of Gemma 4 E4B's text decoder, produced via [`nvidia-modelopt`](https://github.com/NVIDIA/Model-Optimizer). Hardware-agnostic checkpoint; **inference requires NVIDIA Blackwell GPUs** (RTX 50xx, B100/B200, GB200) via TensorRT-LLM. ## What's in this repo ``` config.json # Gemma4ForCausalLM, NVFP4 quantization metadata generation_config.json tokenizer.json + tokenizer_config.json + chat_template.jinja model.safetensors # NVFP4 weights (~5-6 GB) ``` This is the **modelopt checkpoint**, not a TRT-LLM engine. ## Build the engine on your Blackwell GPU ```bash # Download git lfs install git clone https://huggingface.co/tss-deposium/gemma-4-E4B-text-only-nvfp4 cd gemma-4-E4B-text-only-nvfp4 # Validate before the full build (~30s) — cheap signal for compatibility trtllm-build --checkpoint_dir . --output_dir /tmp/dryrun \ --dry_run --log_level debug # Full engine build (10-30 min on RTX 50xx) trtllm-build --checkpoint_dir . \ --output_dir ./engine \ --gemm_plugin nvfp4 \ --max_batch_size 4 --max_input_len 4096 --max_seq_len 5120 \ --use_paged_context_fmha enable ``` ## Inference ```python from tensorrt_llm.runtime import ModelRunner from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("tss-deposium/gemma-4-E4B-text-only-nvfp4") runner = ModelRunner.from_dir("./engine") prompt = tok.apply_chat_template( [{"role": "user", "content": "Quelle est la capitale de la France ?"}], tokenize=False, add_generation_prompt=True, ) ids = tok(prompt, return_tensors="pt").input_ids.cuda() out = runner.generate(ids, max_new_tokens=64) print(tok.decode(out[0][0])) ``` ## Caveats — read before adopting - **Blackwell required**: NVFP4 is hardware-accelerated only on RTX 50xx, B100/B200, GB200. On older GPUs, FP4 ops fall back to FP16 simulation, losing the speedup. - **Format mobility**: NVFP4 checkpoint format may change between modelopt minor releases. Pin your modelopt version to match this checkpoint's source notebook (see Provenance below). - **Gemma 4 in NVFP4 is experimental**: as of 2026-05, Gemma 4 is not in NVIDIA's official NVFP4 support matrix. modelopt + trtllm-build may regress on future updates. - **Calibration corpus**: ~140 multilingual prompts (FR/EN/ES/DE/IT/PT/RU/JA/extraction/JSON/code/long-context). If your inference distribution differs significantly, recalibrate from FP16 with your own corpus. ## When to use this vs `tss-deposium/gemma-4-E4B-text-only-onnx-int4` | | This repo (NVFP4) | Sibling INT4 ONNX | |---|---|---| | Hardware | Blackwell only | Any GPU + CPU fallback | | Stack | TensorRT-LLM | ONNX Runtime | | Vitesse | 1.5-3× INT4 ONNX on Blackwell | baseline | | Portabilité | Self-hosted RTX 50xx only | Linux/Docker/Railway/cross-OS | | Quality | ~97-99% MMLU | ~95-97% MMLU | If you're not on Blackwell, or you need cross-platform deployment, use the INT4 ONNX repo. ## Provenance - **Author**: Nicolas Geysse — The Seed Ship (Deposium project, [theseedship/deposium-turbov3](https://github.com/theseedship/deposium-turbov3)) - **Source model**: `google/gemma-4-E4B-it` (multimodal — text decoder loaded directly via `AutoModelForCausalLM`) - **Quantization**: `nvidia-modelopt` `NVFP4_DEFAULT_CFG` - **Pipeline**: [`docs/gemma4_e4b_nvfp4_modelopt_export.ipynb`](https://github.com/theseedship/deposium-turbov3/blob/main/docs/gemma4_e4b_nvfp4_modelopt_export.ipynb) - **License**: Gemma terms of use (inherited)