TheStageAI/gemma-4-E2B-it-qat-GGUF

A portable GGUF release of Google's Gemma 4 E2B instruction model, compressed from Google's QAT-trained BF16 weights and emitted as standard llama.cpp-compatible .gguf files.

Use this repo when deployment portability matters most. If you can run our native MLX runtime and want the smallest artifacts, use the edge-lm sibling release.

Why this exists

The native edge-lm checkpoints use custom codecs for both decoder weights and PLE tables, which is why they are smaller at comparable quality. Many deployments, however, need standard GGUF files that work with llama.cpp-compatible tooling.

This repo keeps the production bit-width schedules from our native compression pipeline, but maps the weights into GGUF-compatible quantization formats. The result is larger than the native release, but portable.

How it was compressed

We start from Google's QAT-trained BF16 checkpoint and reuse the production m and l schedules from the native release.

  • Transformer blocks - the M and L files follow our RCO-selected production bit-width schedules, then emit the weights in GGUF-compatible K-quant layouts with the required group sizes and symmetric/asymmetric modes for each tensor family.
  • PLE tables - stored with GGUF-compatible Q4 scalar quantization instead of the native AQLM PLE codec, so the files stay portable across GGUF runtimes.
  • Token embeddings / LM head - quantized through the same GGUF-compatible path as the rest of the model.
  • W4-uniform - a conservative uniform 4-bit GGUF variant with the same Q4 PLE path.

Operating points

File Trade-off Size Compression vs BF16 Transformer PLE
gemma-4-E2B-it-qat-GGUF-M.gguf Compact GGUF target 2.47 GB 4.1x production m mapped to GGUF GGUF Q4
gemma-4-E2B-it-qat-GGUF-L.gguf Higher-quality GGUF target 2.68 GB 3.8x production l mapped to GGUF GGUF Q4
gemma-4-E2B-it-qat-GGUF-W4-uniform.gguf Uniform W4 baseline 2.69 GB 3.8x uniform W4 GGUF GGUF Q4

Usage

Use a recent upstream llama.cpp build. Example:

llama-completion \
  -m gemma-4-E2B-it-qat-GGUF-L.gguf \
  -p "Explain gravity in one sentence." \
  -n 64

Benchmarks

For quality evaluation, GGUF checkpoints are converted through the same dequantized BF16 evaluation path used for the native release, so the backend is equalized. IFEval p/i means prompt strict / instruction strict, using the corrected public recipe with max_gen_toks=1280.

Model Size Compression MMLU-Pro IFEval p/i
BF16 reference 10.21 GB 1.0x 61.85 75.23 / 82.37
GGUF M 2.47 GB 4.1x 53.79 72.64 / 81.29
GGUF L 2.68 GB 3.8x 57.12 73.38 / 81.65
GGUF W4-uniform 2.69 GB 3.8x 56.91 74.68 / 82.61

MMLU-Pro is the official checkpoint-wise vLLM route with Gemma chat formatting and thinking enabled. The .gguf files in this repo also passed generation smoke tests with upstream llama.cpp.

Files

File Contents
gemma-4-E2B-it-qat-GGUF-M.gguf Compact GGUF target
gemma-4-E2B-it-qat-GGUF-L.gguf Higher-quality GGUF target
gemma-4-E2B-it-qat-GGUF-W4-uniform.gguf Uniform W4 GGUF baseline

License

Released under the MIT License. As a derivative of Gemma, the weights are also subject to the Gemma Terms of Use.

Citation

If you use these checkpoints, please cite the Gemma 4 release and the methods we build on (GPTQ, QEP, AQLM, RCO) - see the references in the edge-lm write-up.

Downloads last month
463
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for TheStageAI/gemma-4-E2B-it-qat-GGUF

Quantized
(28)
this model