--- license: apache-2.0 base_model: google/gemma-4-e2b-it base_model_relation: quantized language: - en tags: - gemma - gemma4 - google - moe - mixture-of-experts - multimodal - vision - quantized - int8 - INT8 - w8a16 - compressed-tensors - vllm - text-generation - conversational - 8-bit - ptq - autoround - llmcompressor - safetensors - sglang - text-generation-inference - 88plug - post-training-quantization - image - vlm pipeline_tag: image-text-to-text library_name: transformers model_type: gemma4 --- # Gemma4-E2B-W8A16 INT8 post-training quantization of [google/gemma-4-e2b-it](https://huggingface.co/google/gemma-4-e2b-it) — Google's 2B-active multimodal MoE with 128 experts. The smallest capable multimodal MoE. **Runs on any 8 GB GPU.** --- ## At a Glance | Property | Value | |---|---| | Base model | `google/gemma-4-e2b-it` | | Architecture | Sparse MoE, 128 experts, hybrid sliding+global attention + SigLIP vision | | Quant format | compressed-tensors (native vLLM) | | Quant method | AutoRound W8A16 (RTN, datafree) | | Quantized | `language_model.*` transformer layers | | Kept BF16 | vision_tower, multi_modal_projector, embed_tokens_per_layer (PLE) | | Min GPU | 1× RTX 3080 10GB / RTX 4070 | --- ## Quick Start Tested with **vLLM v0.21.0** (`vllm/vllm-openai:v0.21.0-cu129-ubuntu2404`). Weights are in **compressed-tensors** format — vLLM detects and loads quantization automatically. No `--quantization` flag needed. ### vLLM ```bash docker run --gpus device=0 -p 8080:8080 \ vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \ 88plug/Gemma4-E2B-it-W8A16 \ --kv-cache-dtype fp8 \ --max-model-len 32768 \ --gpu-memory-utilization 0.90 ``` Weights are in **compressed-tensors** format — no `--quantization` flag needed. Requires **vLLM ≥ v0.21.0**. ### SGLang ```bash docker run --gpus device=0 -p 30000:30000 \ lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \ --model-path google/gemma-4-e2b-it \ --tp 1 \ --mem-fraction-static 0.85 \ --port 30000 ``` ### llama.cpp Fits entirely on an 8 GB GPU with Q4 quantization. VLM requires mmproj GGUF for image input. ```bash python convert_hf_to_gguf.py google/gemma-4-e2b-it \ --outfile Gemma4-E2B-BF16.gguf python convert_hf_to_gguf.py google/gemma-4-e2b-it \ --mmproj --outfile Gemma4-E2B-mmproj.gguf llama-quantize Gemma4-E2B-BF16.gguf Gemma4-E2B-Q8_0.gguf Q8_0 llama-quantize --imatrix calibration_datav3.txt \ Gemma4-E2B-BF16.gguf Gemma4-E2B-IQ4_XS.gguf IQ4_XS llama-server \ --model Gemma4-E2B-Q8_0.gguf \ --mmproj Gemma4-E2B-mmproj.gguf \ --n-gpu-layers 999 \ --ctx-size 32768 \ --port 8081 ``` --- ## Benchmarks *Results pending.* | Engine | Format | Batch | ctx | tok/s | TTFT p50 | TTFT p99 | VRAM | |---|---|---|---|---|---|---|---| | vLLM v0.21.0 | W8A16 | 1 | 32k | — | — | — | — | | vLLM v0.21.0 | W8A16 | 8 | 32k | — | — | — | — | | SGLang v0.5.8 | BF16 (baseline) | 1 | 32k | — | — | — | — | | llama.cpp b9297 | Q8_0 GGUF | 1 | 32k | — | — | — | — | | llama.cpp b9297 | IQ4_XS GGUF | 1 | 32k | — | — | — | — | Hardware: A6000 48 GB, CUDA 12.9, driver 570. --- ## Quality Targets | Metric | Target | |---|---| | KL divergence vs BF16 | < 0.005 | | MMLU recovery | ≥ 99.7% | ### vs. Other Gemma4-E2B Quants This is the first compressed-tensors W8A16 checkpoint for Gemma4-E2B. At ~2.5 GB it is the smallest vLLM-native multimodal checkpoint that fits on consumer 8 GB GPUs. | Quant | Method | Size | GPU Compatibility | Notes | |---|---|---|---|---| | **88plug W8A16 (this)** | compressed-tensors RTN W8A16 | ~2.5 GB | Any Ampere+ ≥8 GB | First W8A16; native vLLM; vision+text | | BF16 baseline | None | ~4.5 GB | 1× RTX 3080 10GB | Reference | | Community GGUF Q4_K_M | llama.cpp GGUF | ~2.5 GB | CPU / any GPU | Vision requires mmproj GGUF | | Community GGUF Q8_0 | llama.cpp GGUF | ~4.5 GB | Any GPU ≥6 GB | Near-lossless; vision requires mmproj | --- ## Limitations - **Vision tower excluded**: SigLIP vision encoder stays BF16 — RTN INT8 not applied to vision components. - **PLE layers excluded**: `embed_tokens_per_layer` and `per_layer_model_projection` (Per-Layer Embeddings) kept at BF16 to prevent catastrophic quality loss. - **RTN (data-free) quantization**: No calibration corpus used. W8A16 RTN is near-lossless but has not been AutoRound-calibrated. - **Benchmark results pending**: Throughput and quality benchmarks will be added post-publication. --- ## Citation ```bibtex @misc{gemma4report, title = {Gemma 4 Technical Report}, author = {Google DeepMind}, year = {2025}, url = {https://huggingface.co/google/gemma-4-e2b-it} } ``` --- ## About [**88plug AI Lab**](https://huggingface.co/88plug) produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags. **W8A16** — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot. **W4A16** — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production. All weights are in compressed-tensors format. vLLM detects quantization automatically from `quantization_config` in `config.json`. No `--quantization` flag required. **Also available:** [Gemma4-E2B-it-W4A16 (INT4, ~6 GB)](https://huggingface.co/88plug/Gemma4-E2B-it-W4A16) · [Gemma4-E2B-it-W8A16 (INT8, ~7 GB)](https://huggingface.co/88plug/Gemma4-E2B-it-W8A16) Browse all releases → [huggingface.co/88plug](https://huggingface.co/88plug)