--- license: apache-2.0 base_model: google/gemma-4-31B-it tags: - awq - turboquant - kv-cache-quantization - gemma - gemma4 - quantized - 8bit library_name: transformers pipeline_tag: image-text-to-text --- # Gemma 4 31B-it - TurboQuant AWQ 8-bit **8-bit AWQ-quantized version** of [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) (31B dense, instruction-tuned) with TurboQuant KV-cache quantization. AWQ (Activation-aware Weight Quantization) is an activation-aware method optimal for GPU inference. The 8-bit variant keeps quality very close to the FP16 baseline while halving VRAM usage, suitable for high-fidelity chat deployments. Approximate model size: **~31 GB** ## Model Specifications | Property | Value | |---|---| | **Base Model** | [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) | | **Parameters** | ~31 billion | | **Architecture** | Dense transformer, instruction-tuned | | **Modality** | Multimodal: image + text input, text output | | **License** | Apache 2.0 | | **Weight Quantization** | AWQ 8-bit (~31 GB) | | **Group Size** | 128 | | **KV-Cache Quantization** | TurboQuant | | **Framework** | transformers + AutoAWQ / vLLM | ## Quickstart ### AutoAWQ ```python from awq import AutoAWQForCausalLM from transformers import AutoTokenizer model = AutoAWQForCausalLM.from_quantized( "majentik/gemma-4-31B-it-TurboQuant-AWQ-8bit", device_map="auto", fuse_layers=True, ) tokenizer = AutoTokenizer.from_pretrained("majentik/gemma-4-31B-it-TurboQuant-AWQ-8bit") messages = [{"role": "user", "content": "Provide a summary of transformer architecture."}] inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device) out = model.generate(inputs, max_new_tokens=512) print(tokenizer.decode(out[0], skip_special_tokens=True)) ``` ### vLLM ```bash vllm serve majentik/gemma-4-31B-it-TurboQuant-AWQ-8bit \ --quantization awq_marlin \ --tensor-parallel-size 1 \ --max-model-len 8192 ``` ## What is TurboQuant? TurboQuant ([arXiv: 2504.19874](https://arxiv.org/abs/2504.19874)) is a KV-cache quantization technique that compresses the key-value cache used during autoregressive generation. Combined with 8-bit AWQ weights, it delivers near-FP16 quality at roughly half the VRAM cost. ## KV-Cache Quantization Comparison | Method | Prefill Speed | Decode Speed | Memory Savings | Reference | |---|---|---|---|---| | **TurboQuant** | 1x (baseline) | 1x (baseline) | High | [arXiv: 2504.19874](https://arxiv.org/abs/2504.19874) | | **RotorQuant** | **5.3x faster** | **28% faster** | High | [GitHub](https://github.com/scrya-com/rotorquant) | ## AWQ vs GGUF vs MLX | Format | Target Hardware | Runtime | Best For | |---|---|---|---| | **AWQ** | NVIDIA / AMD GPU (CUDA/ROCm) | AutoAWQ, vLLM, TGI | GPU-native inference, production serving | | **GGUF** | CPU + GPU (cross-platform) | llama.cpp, Ollama, LM Studio | Laptops, CPU-only boxes, mixed offload | | **MLX** | Apple Silicon | MLX, mlx-lm, mlx-vlm | Macs with unified memory | This repo ships **AWQ**. See the "See Also" section for GGUF and MLX siblings. ## Memory Estimates (Gemma 4 31B-it) | Precision | Approximate Size | VRAM Tier | |---|---|---| | FP16 (original) | ~62 GB | 80 GB+ (A100/H100) | | **AWQ 8-bit** | **~31 GB** | **40 GB+ (A100 40/80GB, L40S, 2x RTX 4090)** | | AWQ 4-bit | ~17 GB | 24 GB+ | Best deployed on server-class GPUs (A100 40/80GB, L40S, H100) or dual RTX 4090 with tensor parallelism. ## Hardware Requirements - NVIDIA GPU with >=40 GB VRAM single-card, or 2x 24 GB cards with TP=2 - Recommended: A100 40GB, A100 80GB, L40S 48GB, H100 80GB - CUDA 12.x recommended - For vLLM: compute capability >= 7.5 (Turing or newer) for Marlin kernels ## See Also - [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) -- Base model - [majentik/gemma-4-31B-it-TurboQuant](https://huggingface.co/majentik/gemma-4-31B-it-TurboQuant) -- TurboQuant KV-cache only (transformers) - [majentik/gemma-4-31B-it-TurboQuant-AWQ-4bit](https://huggingface.co/majentik/gemma-4-31B-it-TurboQuant-AWQ-4bit) -- AWQ 4-bit variant - [majentik/gemma-4-31B-it-RotorQuant-AWQ-8bit](https://huggingface.co/majentik/gemma-4-31B-it-RotorQuant-AWQ-8bit) -- RotorQuant AWQ 8-bit variant - [majentik/gemma-4-31B-it-TurboQuant-MLX-8bit](https://huggingface.co/majentik/gemma-4-31B-it-TurboQuant-MLX-8bit) -- MLX variant (Apple Silicon) - [TurboQuant Paper (arXiv: 2504.19874)](https://arxiv.org/abs/2504.19874) - [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) - [vLLM](https://github.com/vllm-project/vllm) ## Quant trade-off (AWQ lane) | Bits | Approx size | Use case | Recommendation | |---|---|---|---| | 4-bit | ~13 GB | Activation-aware 4-bit weight quant | GPU inference (vLLM, transformers, AutoAWQ) | | **8-bit** | ~24 GB | Activation-aware 8-bit weight quant | **Quality-sensitive GPU inference** | (Current variant — **8bit** — is bolded.) ## Variants in this family (Showing 18 sibling variants under `majentik/gemma4-31b-it-*`. The current variant — `TurboQuant-AWQ-8bit` — is **bolded**.) | Variant | Runtime | Approx size | Use case | |---|---|---|---| | [RotorQuant](https://huggingface.co/majentik/gemma4-31b-it-rotorquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) | | [RotorQuant-AWQ-4bit](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-awq-4bit) | transformers | ~19 GB | GPU 4-bit (AutoAWQ) | | [RotorQuant-AWQ-8bit](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-awq-8bit) | transformers | ~34 GB | GPU 8-bit (AutoAWQ) | | [RotorQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-gguf-IQ4_XS) | llama.cpp | ~27 GB | Lossy 4-bit, low-RAM CPU/edge | | [RotorQuant-GGUF-Q2_K](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-gguf-Q2_K) | llama.cpp | ~19 GB | Lossy, low-RAM CPU/edge | | [RotorQuant-GGUF-Q3_K_M](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-gguf-Q3_K_M) | llama.cpp | ~24 GB | Smaller 3-bit, CPU-friendly | | [RotorQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-gguf-Q4_K_M) | llama.cpp | ~34 GB | Balanced default | | [RotorQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-gguf-Q5_K_M) | llama.cpp | ~41 GB | Higher fidelity, more RAM | | [RotorQuant-GGUF-Q8_0](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-gguf-Q8_0) | llama.cpp | ~65 GB | Near-lossless reference | | [RotorQuant-MLX-2bit](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-mlx-2bit) | mlx-lm | ~9.9 GB | Apple Silicon, smallest | | [RotorQuant-MLX-4bit](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-mlx-4bit) | mlx-lm | ~19 GB | Apple Silicon balanced | | [RotorQuant-MLX-8bit](https://huggingface.co/majentik/gemma4-31b-it-rotorquant-mlx-8bit) | mlx-lm | ~37 GB | Apple Silicon reference | | [TurboQuant](https://huggingface.co/majentik/gemma4-31b-it-turboquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) | | [TurboQuant-AWQ-4bit](https://huggingface.co/majentik/gemma4-31b-it-turboquant-awq-4bit) | transformers | ~19 GB | GPU 4-bit (AutoAWQ) | | **TurboQuant-AWQ-8bit** | transformers | ~34 GB | GPU 8-bit (AutoAWQ) | | [TurboQuant-MLX-2bit](https://huggingface.co/majentik/gemma4-31b-it-turboquant-mlx-2bit) | mlx-lm | ~9.9 GB | Apple Silicon, smallest | | [TurboQuant-MLX-4bit](https://huggingface.co/majentik/gemma4-31b-it-turboquant-mlx-4bit) | mlx-lm | ~19 GB | Apple Silicon balanced | | [TurboQuant-MLX-8bit](https://huggingface.co/majentik/gemma4-31b-it-turboquant-mlx-8bit) | mlx-lm | ~37 GB | Apple Silicon reference |