--- license: apache-2.0 base_model: - google/gemma-4-E4B-it tags: - litert-lm - backup - redistribution --- # notabilia/gemma-4-E4B-it-litert-lm > **Disclaimer:** This repository is an **unofficial backup and redistribution** of the > [`litert-community/gemma-4-E4B-it-litert-lm`](https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm) > model, which itself is derived from Google's > [`google/gemma-4-E4B-it`](https://huggingface.co/google/gemma-4-E4B-it). > > The files in this repository are **unaltered** from the source. This backup is maintained > purely for personal archival and availability purposes. I am not affiliated with Google, > the LiteRT community, or any of the original authors, and claim no credit for the model > or its development. > > For the authoritative source, documentation, and support, please refer to the official > repositories linked above. ## License and Terms of Use This repository redistributes files under the [Apache License 2.0](./LICENSE), consistent with the license of the upstream source. **Important:** By downloading or using any files from this repository, you are also bound by Google's [Gemma Terms of Use](https://ai.google.dev/gemma/terms). Please review those terms before use. This repository does not grant any rights beyond what is provided by the Apache 2.0 license and Google's Gemma Terms of Use. --- **Official Model Card:** [litert-community/gemma-4-E4B-it-litert-lm](https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm) **Base Model:** [google/gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it) The following documentation is reproduced from the official model card for reference. --- This model card provides the Gemma 4 E4B model in a way that is ready for deployment on Android, iOS, Desktop, IoT and Web. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. This particular Gemma 4 model is small so it is ideal for on-device use cases. By running this model on device, users can have private access to Generative AI technology without even requiring an internet connection. These models are provided in the `.litertlm` format for use with the LiteRT-LM framework. LiteRT-LM is a specialized orchestration layer built directly on top of LiteRT, Google’s high-performance multi-platform runtime trusted by millions of Android and edge developers. LiteRT provides the foundational hardware acceleration via XNNPack for CPU and ML Drift for GPU. LiteRT-LM adds the specialized GenAI libraries and APIs, such as KV-cache management, prompt templating, and function calling. This integrated stack is the same technology powering the Google AI Edge Gallery showcase app. The model file size is 3.66 GB, which includes a text decoder with 2.24 GB of weights and 0.67 GB of embedding parameters. LiteRT-LM framework always keeps main weights in memory, while the embedding parameters are memory mapped which enables significant working memory savings on some platforms as seen in the detailed data below. The vision and audio models are loaded as needed to further reduce memory consumption. ## Try Gemma 4 E4B
| [](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) | [](https://apps.apple.com/us/app/google-ai-edge-gallery/id6749645337) | [](https://ai.google.dev/edge/litert-lm/cli) | [](https://ai.google.dev/edge/litert-lm/cli) | [](https://huggingface.co/spaces/tylermullen/Gemma4) | | :---: | :---: | :---: | :---: | :---: | | [Android](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) | [iOS](https://apps.apple.com/us/app/google-ai-edge-gallery/id6749645337) | [Desktop](https://ai.google.dev/edge/litert-lm/cli) | [IoT](https://ai.google.dev/edge/litert-lm/cli) | [Web](https://huggingface.co/spaces/tylermullen/Gemma4) |
## Build with Gemma 4 E4B and LiteRT-LM Ready to integrate this into your product? Get started [here](https://ai.google.dev/edge/litert-lm/overview). ## Gemma 4 E4B Performance on LiteRT-LM All benchmarks were taken using 1024 prefill tokens and 256 decode tokens with a context length of 2048 tokens via LiteRT-LM. The model can support up to 32k context length. The inference on CPU is accelerated via the LiteRT XNNPACK delegate with 4 threads. Time-to-first-token does not include load time. Benchmarks were run with caches enabled and initialized. During the first run, the latency and memory usage may differ. Model size is the size of the file on disk. CPU memory was measured using, `rusage::ru_maxrss` on Android, Linux and Raspberry Pi, `task_vm_info::phys_footprint` on iOS and MacBook and `process_memory_counters::PrivateUsage` on Windows. **Android** *Note: On [supported Android devices](https://developers.google.com/ml-kit), Gemma 4 is available through Android AI Core as [Gemini Nano](https://developer.android.com/ai/gemini-nano#architecture), which is the recommended path for production applications.* | Device                                     | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU Memory (MB) | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | S26 Ultra | CPU | 195 | 17.7 | 5.3 | 3654 | 3283 | | S26 Ultra | GPU | 1,293 | 22.1 | 0.8 | 3654 | 710 | **🚨 NEW: Android with Speculative Decoding 🚨** *The numbers in this section include speculative decoding. Speculative decoding is an optimization that accelerates LLMs by using a small, fast "draft" model to quickly predict multiple upcoming tokens, while a larger “target” model then verifies those tokens in parallel. The effectiveness of speculative decoding is task dependent because the “draft” model can more easily predict the correct tokens of some tasks. The metrics in this section were collected from a variety of sample prompts and grouped into categories by task type. The baseline measurements are an average across all task types. The number of input and output tokens varied across prompts. Note that if you download this model before May 5, 2026, you should re-download the model if you want to use speculative decoding. Speculative decoding is available on CPU and GPU on Mobile and Desktop.* | Device                                     | Backend | Task Type | Speculative Decoding? | Decode (tokens/sec) | CPU Memory (MB) | | :---- | :---- | :---- | :---- | :---- | :---- | | S26 Ultra | CPU | Baseline | No | 17.0 | 2800 | | S26 Ultra | CPU | Summarize text | Yes | 27.5 | 3116 | | S26 Ultra | CPU | Code snippet | Yes | 26.2 | 2946 | | S26 Ultra | CPU | Rewrite tone | Yes | 29.5 | 2922 | | S26 Ultra | CPU | Free form | Yes | 21.1 | 2962 | | S26 Ultra | GPU | Baseline | No | 21.9 | 837 | | S26 Ultra | GPU | Summarize text | Yes | 46.0 | 1069 | | S26 Ultra | GPU | Code snippet | Yes | 49.4 | 1021 | | S26 Ultra | GPU | Rewrite tone | Yes | 47.5 | 999 | | S26 Ultra | GPU | Free form | Yes | 36.7 | 991 | **iOS** | Device                                     | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU/GPU Memory (MB) | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | iPhone 17 Pro | CPU | 159 | 9.7 | 6.5 | 3654 | 961 | | iPhone 17 Pro | GPU | 1,189 | 25.1 | 0.9 | 3654 | 3380 | **Linux** | Device                                     | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU Memory (MB) | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | Arm 2.3 & 2.8GHz | CPU | 82 | 17.5 | 12.6 | 3654 | 3139 | | NVIDIA GeForce RTX 4090 | GPU | 7,260 | 91.2 | 0.2 | 3654 | 1119 | **macOS** | Device                                     | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU/GPU Memory (MB) | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | MacBook Pro M4 Max | CPU | 277 | 27.0 | 3.7 | 3654 | 890 | | MacBook Pro M4 Max | GPU | 2,560 | 101.1 | 0.4 | 3654 | 3217 | **Windows** | Device                                     | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU Memory (MB) | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | Intel LunarLake | CPU | 173 | 16.8 | 5.98 | 3654 | 9372 | | Intel LunarLake | GPU | 1202 | 25.13| 0.89 | 3654 | 7147 | **IoT** | Device                                     | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU Memory (MB) | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | Raspberry Pi 5 16GB | CPU | 51 | 3.2 | 20.5 | 3654 | 3069 | ## Gemma 4 E4B on Web Running Gemma inference on the web is currently supported through [LLM Inference Engine](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/web_js) and uses the *gemma-4-E4B-it-web.task* model file. Try it out [live in your browser](https://huggingface.co/spaces/tylermullen/Gemma4) (Chrome with WebGPU recommended). To start developing with it, download [the web model](https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm/blob/main/gemma-4-E4B-it-web.task) and run with our [sample web page](https://github.com/google-ai-edge/mediapipe-samples/blob/main/examples/llm_inference/js/README.md), or follow the [guide](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/web_js) to add it to your own app. Benchmarked in Chrome on a MacBook Pro 2024 (Apple M4 Max) with 1024 prefill tokens and 256 decode tokens, but the model can support context lengths up to 128K. | Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Initialization time (sec) | Model size (MB) | CPU Memory (GB) | GPU Memory (GB) | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | Web | GPU | 1598 | 44.4 | 1.5 | 2964 | 1.1 | 3.3 | * GPU memory measured by "GPU Process" memory for all of Chrome while running. Was 130MB when inactive, before any model loading took place. * CPU memory measured for the entire tab while running. Was 55MB when inactive, before any model loading took place.