LiteRT-LM

litert-community/gemma-4-12B-it-litert-lm

Main Model Card: google/gemma-4-12B-it

This model card provides the Gemma 4 12B model in LiteRT-LM format that is ready for deployment on macOS and linux. Please check back here regularly for updates on wider platform support and further functionality improvements. The current LiteRT-LM version supports text and audio modalities, image and multitoken prediction support will be avaialble in a future update.

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. This particular Gemma 4 model is a medium size so it is ideal for desktop use cases. By running this model on device, users can have private access to Generative AI technology without even requiring an internet connection.

These models are provided in the .litertlm format for use with the LiteRT-LM framework. LiteRT-LM is a specialized orchestration layer built directly on top of LiteRT, Google’s high-performance multi-platform runtime trusted by millions of Android and edge developers. LiteRT provides the foundational hardware acceleration via XNNPack for CPU and ML Drift for GPU. LiteRT-LM adds the specialized GenAI libraries and APIs, such as KV-cache management, prompt templating, and function calling. This integrated stack is the same technology powering the Google AI Edge Gallery showcase app.

Try Gemma 4 12B

Build with Gemma 4 12B and LiteRT-LM

Ready to integrate this into your product? Get started with LiteRT-LM documentation.

Gemma 4 12B Performance on LiteRT-LM

All benchmarks were taken using 1024 prefill tokens and 256 decode tokens with a context length of 2048 tokens via LiteRT-LM. The model can support up to 32k context length. Time-to-first-token does not include load time. Benchmarks were run with caches enabled and initialized. During the first run, the latency and memory usage may differ. Model size is the size of the file on disk.

The Linux GPU memory usage number was measured with --disable_cache=true --convert_weights_on_gpu=false. Otherwise it will be about 6GB higher.

Linux

Device                                      Backend Prefill (tokens/sec) Decode (tokens/sec) Time-to-first-token (sec) Model size (MB) GPU Memory (MB)
AMD Radeon™ AI PRO R9700 GPU 662.32 66.26 1.56 6235 8064.2

macOS

Device                                      Backend Prefill (tokens/sec) Decode (tokens/sec) Time-to-first-token (sec) Model size (MB) GPU Memory (MB)
MacBook Pro M4 GPU 243.55 29.56 4.2 6235 7763
Downloads last month
22,535
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for litert-community/gemma-4-12B-it-litert-lm

Finetuned
(73)
this model

Collection including litert-community/gemma-4-12B-it-litert-lm