--- license: other tags: - gguf - llama.cpp - gemma-4-E4B-it - Q4_K_M - cpu-inference - text-generation pipeline_tag: text-generation --- # gemma-4-E4B-it — GGUF (Q4_K_M) --- ## 📊 Performance Metrics - **Hardware:** Intel(R) Xeon(R) CPU @ 2.20GHz (4 vCPUs) - **Size:** 4.97 GB - **Speed (Generation):** 4.18 tokens/sec - **Speed (Prompt):** 9.83 tokens/sec - **KV Cache Usage:** 0.0143 GB - **Quantization:** Q4_K_M --- ## 🔷 Model Overview This repository contains a **GGUF quantized version** of: - **Base Model:** gemma-4-E4B-it - **Format:** GGUF (optimized for llama.cpp inference) - **Precision:** Q4_K_M - **Efficiency Score:** 0.8412 (TPS/GB) GGUF format provides: - Fast loading via memory mapping - Single-file model distribution - Cross-platform compatibility - Efficient inference with llama.cpp --- ## 📦 Files | File | Description | |------|-------------| | `gemma-4-E4B-it-Q4_K_M.gguf` | Quantized GGUF model file | --- ## ⚙️ Technical Details | Parameter | Value | |----------|------| | Architecture | gemma-4-E4B-it | | Format | GGUF | | Precision | Q4_K_M | | Runtime | llama.cpp | | Benchmark Hardware | Intel(R) Xeon(R) CPU @ 2.20GHz (4 vCPUs) | | Context Latency | 52.44s | | Memory (KV) | 0.0143 GB | --- ## ⚡ Why GGUF? GGUF is designed for efficient inference: - Optimized for llama.cpp - Supports CPU and GPU inference - Single-file deployment - Memory-mapped loading for speed - Ideal for edge / local environments --- ## ⚠️ License & Usage This is a **converted derivative model**. - You must comply with the original model license of gemma-4-E4B-it - This is **not an official release** - No additional rights are granted - Original ownership remains with the base model creator --- ## 🚀 Quick Start (llama.cpp) ```bash ./llama-cli -m gemma-4-E4B-it-Q4_K_M.gguf -p "Explain AI simply"