---
license: apache-2.0
base_model: google/gemma-4-e2b-it
base_model_relation: quantized
language:
- en
tags:
- gemma
- gemma4
- google
- moe
- mixture-of-experts
- multimodal
- vision
- quantized
- int8
- INT8
- w8a16
- compressed-tensors
- vllm
- text-generation
- conversational
- 8-bit
- ptq
- autoround
- llmcompressor
- safetensors
- sglang
- text-generation-inference
- 88plug
- post-training-quantization
- image
- vlm
pipeline_tag: image-text-to-text
library_name: transformers
model_type: gemma4
---

# Gemma4-E2B-W8A16

INT8 post-training quantization of [google/gemma-4-e2b-it](https://huggingface.co/google/gemma-4-e2b-it) — Google's 2B-active multimodal MoE with 128 experts. The smallest capable multimodal MoE. **Runs on any 8 GB GPU.**

---

## At a Glance

| Property | Value |
|---|---|
| Base model | `google/gemma-4-e2b-it` |
| Architecture | Sparse MoE, 128 experts, hybrid sliding+global attention + SigLIP vision |
| Quant format | compressed-tensors (native vLLM) |
| Quant method | AutoRound W8A16 (RTN, datafree) |
| Quantized | `language_model.*` transformer layers |
| Kept BF16 | vision_tower, multi_modal_projector, embed_tokens_per_layer (PLE) |
| Min GPU | 1× RTX 3080 10GB / RTX 4070 |

---

## Quick Start

Tested with **vLLM v0.21.0** (`vllm/vllm-openai:v0.21.0-cu129-ubuntu2404`). Weights are in **compressed-tensors** format — vLLM detects and loads quantization automatically. No `--quantization` flag needed.

### vLLM

```bash
docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Gemma4-E2B-it-W8A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90
```

Weights are in **compressed-tensors** format — no `--quantization` flag needed. Requires **vLLM ≥ v0.21.0**.

### SGLang

```bash
docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path google/gemma-4-e2b-it \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000
```

### llama.cpp

Fits entirely on an 8 GB GPU with Q4 quantization. VLM requires mmproj GGUF for image input.

```bash
python convert_hf_to_gguf.py google/gemma-4-e2b-it \
  --outfile Gemma4-E2B-BF16.gguf
python convert_hf_to_gguf.py google/gemma-4-e2b-it \
  --mmproj --outfile Gemma4-E2B-mmproj.gguf

llama-quantize Gemma4-E2B-BF16.gguf Gemma4-E2B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Gemma4-E2B-BF16.gguf Gemma4-E2B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Gemma4-E2B-Q8_0.gguf \
  --mmproj Gemma4-E2B-mmproj.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081
```

---

## Benchmarks

*Results pending.*

| Engine | Format | Batch | ctx | tok/s | TTFT p50 | TTFT p99 | VRAM |
|---|---|---|---|---|---|---|---|
| vLLM v0.21.0 | W8A16 | 1 | 32k | — | — | — | — |
| vLLM v0.21.0 | W8A16 | 8 | 32k | — | — | — | — |
| SGLang v0.5.8 | BF16 (baseline) | 1 | 32k | — | — | — | — |
| llama.cpp b9297 | Q8_0 GGUF | 1 | 32k | — | — | — | — |
| llama.cpp b9297 | IQ4_XS GGUF | 1 | 32k | — | — | — | — |

Hardware: A6000 48 GB, CUDA 12.9, driver 570.

---

## Quality Targets

| Metric | Target |
|---|---|
| KL divergence vs BF16 | < 0.005 |
| MMLU recovery | ≥ 99.7% |

### vs. Other Gemma4-E2B Quants

This is the first compressed-tensors W8A16 checkpoint for Gemma4-E2B. At ~2.5 GB it is the smallest vLLM-native multimodal checkpoint that fits on consumer 8 GB GPUs.

| Quant | Method | Size | GPU Compatibility | Notes |
|---|---|---|---|---|
| **88plug W8A16 (this)** | compressed-tensors RTN W8A16 | ~2.5 GB | Any Ampere+ ≥8 GB | First W8A16; native vLLM; vision+text |
| BF16 baseline | None | ~4.5 GB | 1× RTX 3080 10GB | Reference |
| Community GGUF Q4_K_M | llama.cpp GGUF | ~2.5 GB | CPU / any GPU | Vision requires mmproj GGUF |
| Community GGUF Q8_0 | llama.cpp GGUF | ~4.5 GB | Any GPU ≥6 GB | Near-lossless; vision requires mmproj |

---

## Limitations

- **Vision tower excluded**: SigLIP vision encoder stays BF16 — RTN INT8 not applied to vision components.
- **PLE layers excluded**: `embed_tokens_per_layer` and `per_layer_model_projection` (Per-Layer Embeddings) kept at BF16 to prevent catastrophic quality loss.
- **RTN (data-free) quantization**: No calibration corpus used. W8A16 RTN is near-lossless but has not been AutoRound-calibrated.
- **Benchmark results pending**: Throughput and quality benchmarks will be added post-publication.

---

## Citation

```bibtex
@misc{gemma4report,
  title  = {Gemma 4 Technical Report},
  author = {Google DeepMind},
  year   = {2025},
  url    = {https://huggingface.co/google/gemma-4-e2b-it}
}
```

---

## About

[**88plug AI Lab**](https://huggingface.co/88plug) produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags.

**W8A16** — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

**W4A16** — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from `quantization_config` in `config.json`. No `--quantization` flag required.

**Also available:** [Gemma4-E2B-it-W4A16 (INT4, ~6 GB)](https://huggingface.co/88plug/Gemma4-E2B-it-W4A16) · [Gemma4-E2B-it-W8A16 (INT8, ~7 GB)](https://huggingface.co/88plug/Gemma4-E2B-it-W8A16)

Browse all releases → [huggingface.co/88plug](https://huggingface.co/88plug)