---
license: other
tags:
- gguf
- llama.cpp
- gemma-4-E4B-it
- Q4_K_M
- cpu-inference
- text-generation
pipeline_tag: text-generation
---

# gemma-4-E4B-it — GGUF (Q4_K_M)

---

## 📊 Performance Metrics

- **Hardware:** Intel(R) Xeon(R) CPU @ 2.20GHz (4 vCPUs)
- **Size:** 4.97 GB  
- **Speed (Generation):** 4.18 tokens/sec  
- **Speed (Prompt):** 9.83 tokens/sec  
- **KV Cache Usage:** 0.0143 GB  
- **Quantization:** Q4_K_M  

---

## 🔷 Model Overview

This repository contains a **GGUF quantized version** of:

- **Base Model:** gemma-4-E4B-it
- **Format:** GGUF (optimized for llama.cpp inference)
- **Precision:** Q4_K_M
- **Efficiency Score:** 0.8412 (TPS/GB)

GGUF format provides:
- Fast loading via memory mapping
- Single-file model distribution
- Cross-platform compatibility
- Efficient inference with llama.cpp

---

## 📦 Files

| File | Description |
|------|-------------|
| `gemma-4-E4B-it-Q4_K_M.gguf` | Quantized GGUF model file |

---

## ⚙️ Technical Details

| Parameter | Value |
|----------|------|
| Architecture | gemma-4-E4B-it |
| Format | GGUF |
| Precision | Q4_K_M |
| Runtime | llama.cpp |
| Benchmark Hardware | Intel(R) Xeon(R) CPU @ 2.20GHz (4 vCPUs) |
| Context Latency | 52.44s |
| Memory (KV) | 0.0143 GB |

---

## ⚡ Why GGUF?

GGUF is designed for efficient inference:

- Optimized for llama.cpp
- Supports CPU and GPU inference
- Single-file deployment
- Memory-mapped loading for speed
- Ideal for edge / local environments

---

## ⚠️ License & Usage

This is a **converted derivative model**.

- You must comply with the original model license of gemma-4-E4B-it
- This is **not an official release**
- No additional rights are granted
- Original ownership remains with the base model creator

---

## 🚀 Quick Start (llama.cpp)

```bash
./llama-cli -m gemma-4-E4B-it-Q4_K_M.gguf -p "Explain AI simply"