---
license: apache-2.0
base_model: google/gemma-4-31B-it
tags:
  - awq
  - turboquant
  - kv-cache-quantization
  - gemma
  - gemma4
  - quantized
  - 8bit
library_name: transformers
pipeline_tag: image-text-to-text
---

# Gemma 4 31B-it - TurboQuant AWQ 8-bit

**8-bit AWQ-quantized version** of [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) (31B dense, instruction-tuned) with TurboQuant KV-cache quantization. AWQ (Activation-aware Weight Quantization) is an activation-aware method optimal for GPU inference. The 8-bit variant keeps quality very close to the FP16 baseline while halving VRAM usage, suitable for high-fidelity chat deployments.

Approximate model size: **~31 GB**

## Model Specifications

| Property | Value |
|---|---|
| **Base Model** | [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) |
| **Parameters** | ~31 billion |
| **Architecture** | Dense transformer, instruction-tuned |
| **Modality** | Multimodal: image + text input, text output |
| **License** | Apache 2.0 |
| **Weight Quantization** | AWQ 8-bit (~31 GB) |
| **Group Size** | 128 |
| **KV-Cache Quantization** | TurboQuant |
| **Framework** | transformers + AutoAWQ / vLLM |

## Quickstart

### AutoAWQ

```python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_quantized(
    "majentik/gemma-4-31B-it-TurboQuant-AWQ-8bit",
    device_map="auto",
    fuse_layers=True,
)
tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-31B-it-TurboQuant-AWQ-8bit")

messages = [{"role": "user", "content": "Provide a summary of transformer architecture."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```

### vLLM

```bash
vllm serve majentik/gemma-4-31B-it-TurboQuant-AWQ-8bit \
  --quantization awq_marlin \
  --tensor-parallel-size 1 \
  --max-model-len 8192
```

## What is TurboQuant?

TurboQuant ([arXiv: 2504.19874](https://arxiv.org/abs/2504.19874)) is a KV-cache quantization technique that compresses the key-value cache used during autoregressive generation. Combined with 8-bit AWQ weights, it delivers near-FP16 quality at roughly half the VRAM cost.

## KV-Cache Quantization Comparison

| Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
|---|---|---|---|---|
| **TurboQuant** | 1x (baseline) | 1x (baseline) | High | [arXiv: 2504.19874](https://arxiv.org/abs/2504.19874) |
| **RotorQuant** | **5.3x faster** | **28% faster** | High | [GitHub](https://github.com/scrya-com/rotorquant) |

## AWQ vs GGUF vs MLX

| Format | Target Hardware | Runtime | Best For |
|---|---|---|---|
| **AWQ** | NVIDIA / AMD GPU (CUDA/ROCm) | AutoAWQ, vLLM, TGI | GPU-native inference, production serving |
| **GGUF** | CPU + GPU (cross-platform) | llama.cpp, Ollama, LM Studio | Laptops, CPU-only boxes, mixed offload |
| **MLX** | Apple Silicon | MLX, mlx-lm, mlx-vlm | Macs with unified memory |

This repo ships **AWQ**. See the "See Also" section for GGUF and MLX siblings.

## Memory Estimates (Gemma 4 31B-it)

| Precision | Approximate Size | VRAM Tier |
|---|---|---|
| FP16 (original) | ~62 GB | 80 GB+ (A100/H100) |
| **AWQ 8-bit** | **~31 GB** | **40 GB+ (A100 40/80GB, L40S, 2x RTX 4090)** |
| AWQ 4-bit | ~17 GB | 24 GB+ |

Best deployed on server-class GPUs (A100 40/80GB, L40S, H100) or dual RTX 4090 with tensor parallelism.

## Hardware Requirements

- NVIDIA GPU with >=40 GB VRAM single-card, or 2x 24 GB cards with TP=2
- Recommended: A100 40GB, A100 80GB, L40S 48GB, H100 80GB
- CUDA 12.x recommended
- For vLLM: compute capability >= 7.5 (Turing or newer) for Marlin kernels

## See Also

- [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) -- Base model
- [majentik/gemma-4-31B-it-TurboQuant](https://huggingface.co/majentik/gemma-4-31B-it-TurboQuant) -- TurboQuant KV-cache only (transformers)
- [majentik/gemma-4-31B-it-TurboQuant-AWQ-4bit](https://huggingface.co/majentik/gemma-4-31B-it-TurboQuant-AWQ-4bit) -- AWQ 4-bit variant
- [majentik/gemma-4-31B-it-RotorQuant-AWQ-8bit](https://huggingface.co/majentik/gemma-4-31B-it-RotorQuant-AWQ-8bit) -- RotorQuant AWQ 8-bit variant
- [majentik/gemma-4-31B-it-TurboQuant-MLX-8bit](https://huggingface.co/majentik/gemma-4-31B-it-TurboQuant-MLX-8bit) -- MLX variant (Apple Silicon)
- [TurboQuant Paper (arXiv: 2504.19874)](https://arxiv.org/abs/2504.19874)
- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
- [vLLM](https://github.com/vllm-project/vllm)

## Quant trade-off (AWQ lane)

| Bits | Approx size | Use case | Recommendation |
|---|---|---|---|
| 4-bit | ~13 GB | Activation-aware 4-bit weight quant | GPU inference (vLLM, transformers, AutoAWQ) |
| **8-bit** | ~24 GB | Activation-aware 8-bit weight quant | **Quality-sensitive GPU inference** |

(Current variant — **8bit** — is bolded.)

## Variants in this family

(Showing 18 sibling variants under `majentik/gemma4-31b-it-*`. The current variant — `TurboQuant-AWQ-8bit` — is **bolded**.)

| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| [RotorQuant](https://huggingface.co/majentik/gemma4-31b-it-rotorquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| [RotorQuant-AWQ-4bit](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-awq-4bit) | transformers | ~19 GB | GPU 4-bit (AutoAWQ) |
| [RotorQuant-AWQ-8bit](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-awq-8bit) | transformers | ~34 GB | GPU 8-bit (AutoAWQ) |
| [RotorQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-gguf-IQ4_XS) | llama.cpp | ~27 GB | Lossy 4-bit, low-RAM CPU/edge |
| [RotorQuant-GGUF-Q2_K](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-gguf-Q2_K) | llama.cpp | ~19 GB | Lossy, low-RAM CPU/edge |
| [RotorQuant-GGUF-Q3_K_M](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-gguf-Q3_K_M) | llama.cpp | ~24 GB | Smaller 3-bit, CPU-friendly |
| [RotorQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-gguf-Q4_K_M) | llama.cpp | ~34 GB | Balanced default |
| [RotorQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-gguf-Q5_K_M) | llama.cpp | ~41 GB | Higher fidelity, more RAM |
| [RotorQuant-GGUF-Q8_0](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-gguf-Q8_0) | llama.cpp | ~65 GB | Near-lossless reference |
| [RotorQuant-MLX-2bit](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-mlx-2bit) | mlx-lm | ~9.9 GB | Apple Silicon, smallest |
| [RotorQuant-MLX-4bit](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-mlx-4bit) | mlx-lm | ~19 GB | Apple Silicon balanced |
| [RotorQuant-MLX-8bit](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-mlx-8bit) | mlx-lm | ~37 GB | Apple Silicon reference |
| [TurboQuant](https://huggingface.co/majentik/gemma4-31b-it-turboquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| [TurboQuant-AWQ-4bit](https://huggingface.co/majentik/gemma4-31b-it-turboquant-awq-4bit) | transformers | ~19 GB | GPU 4-bit (AutoAWQ) |
| **TurboQuant-AWQ-8bit** | transformers | ~34 GB | GPU 8-bit (AutoAWQ) |
| [TurboQuant-MLX-2bit](https://huggingface.co/majentik/gemma4-31b-it-turboquant-mlx-2bit) | mlx-lm | ~9.9 GB | Apple Silicon, smallest |
| [TurboQuant-MLX-4bit](https://huggingface.co/majentik/gemma4-31b-it-turboquant-mlx-4bit) | mlx-lm | ~19 GB | Apple Silicon balanced |
| [TurboQuant-MLX-8bit](https://huggingface.co/majentik/gemma4-31b-it-turboquant-mlx-8bit) | mlx-lm | ~37 GB | Apple Silicon reference |