Image-Text-to-Text
LiteRT-LM
LiteRT
LiteRT
Qwen3.5
on-device
hybrid-attention
GatedDeltaNet
multimodal
vision
conversational
Instructions to use GabrieleConte/Qwen3.5-0.8B-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT-LM
How to use GabrieleConte/Qwen3.5-0.8B-LiteRT with LiteRT-LM:
# LiteRT-LM runs on various platforms (Android, iOS, Windows, Linux, macOS, IoT, Web/WASM) # and supports many APIs (C++, Python, Kotlin, Swift, JavaScript, Flutter). # For platform-specific integration guides, please refer to the official developer website: # https://ai.google.dev/edge/litert-lm # To try LiteRT-LM, the easiest way is to use our CLI tool. # 1. Install the LiteRT-LM CLI tool: pip install litert-lm # 2. Download and run this model locally: # See: https://ai.google.dev/edge/litert-lm/cli litert-lm run \ --from-huggingface-repo=GabrieleConte/Qwen3.5-0.8B-LiteRT \ model.litertlm \ --prompt="Write me a poem"
- LiteRT
How to use GabrieleConte/Qwen3.5-0.8B-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| base_model: | |
| - Qwen/Qwen3.5-0.8B | |
| pipeline_tag: image-text-to-text | |
| library_name: litert-lm | |
| tags: | |
| - Qwen3.5 | |
| - litert | |
| - litert-lm | |
| - tflite | |
| - on-device | |
| - hybrid-attention | |
| - GatedDeltaNet | |
| - multimodal | |
| - vision | |
| # Qwen3.5-0.8B LiteRT (Multimodal) | |
| This repository contains a [LiteRT](https://ai.google.dev/edge/litert) (formerly TFLite) conversion of [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) for on-device inference, packaged in the [LiteRT-LM](https://github.com/nicfv/litert-torch) `.litertlm` format. Includes the **full multimodal pipeline**: language model, vision encoder, and vision adapter for image understanding. | |
| ## Model Details | |
| | Property | Value | | |
| |----------|-------| | |
| | **Base Model** | [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) | | |
| | **Architecture** | Hybrid attention (GatedDeltaNet + Full Attention) + ViT vision encoder | | |
| | **Parameters** | 752M (language) + 675M (vision encoder) + 10M (vision adapter) | | |
| | **Quantization** | Dynamic INT8 | | |
| | **KV Cache Length** | 2048 | | |
| | **Prefill Signatures** | 64, 128, 256, 512 | | |
| | **Vision Signatures** | 256, 576, 1024, 2304 patches | | |
| | **Format** | `.litertlm` (LiteRT-LM container) | | |
| ## Architecture | |
| ### Language Model | |
| Qwen3.5-0.8B uses a **hybrid attention** architecture that combines: | |
| - **18 GatedDeltaNet layers** (linear attention with recurrent delta rule) at positions 0-2, 4-6, 8-10, 12-14, 16-18, 20-22 | |
| - **6 Full Attention layers** (standard multi-head attention with output gating and partial RoPE) at positions 3, 7, 11, 15, 19, 23 | |
| ### Vision Encoder | |
| The vision encoder is a 27-layer Vision Transformer (ViT): | |
| - **Patch embedding**: Conv3d (3→1152, kernel=[2,16,16]) with learned position embeddings (bilinear interpolation from 48×48 grid) | |
| - **27 VisionBlocks**: LayerNorm → Self-Attention (16 heads, head_dim=72, 2D rotary pos emb) → MLP (1152→4304→1152, GELU) | |
| - **Patch merger** (vision adapter): Groups 4 adjacent patches (spatial_merge_size=2) and projects to language model dimension (4608→1024) | |
| The model was **re-authored from scratch** using the LiteRT Generative API. The vision encoder and adapter are exported as separate TFLite models bundled alongside the language model. | |
| ## Files | |
| | File | Size | Description | | |
| |------|------|-------------| | |
| | `qwen35_mm_q8_ekv2048.litertlm` | ~1.2 GB | LiteRT-LM bundle (LM + vision encoder + vision adapter + tokenizer) | | |
| | `qwen35_mm_q8_ekv2048.tflite` | ~757 MB | Language model TFLite | | |
| | `qwen35_vision_encoder_q8.tflite` | ~88 MB | Vision encoder TFLite | | |
| | `qwen35_vision_adapter_q8.tflite` | ~12 MB | Vision adapter TFLite | | |
| | `qwen35_embedder_q8.tflite` | ~245 MB | Text embedder TFLite | | |
| | `tokenizer.json` | ~11 MB | HuggingFace tokenizer | | |
| | `tokenizer_config.json` | ~2 KB | Tokenizer configuration | | |
| ## Signatures | |
| ### Language Model | |
| | Signature | Input Length | Outputs | | |
| |-----------|-------------|---------| | |
| | `prefill_64` | 64 tokens | Updated KV cache | | |
| | `prefill_128` | 128 tokens | Updated KV cache | | |
| | `prefill_256` | 256 tokens | Updated KV cache | | |
| | `prefill_512` | 512 tokens | Updated KV cache | | |
| | `decode` | 1 token | Logits + Updated KV cache | | |
| ### Vision Encoder | |
| | Signature | Patches | Approx. Image Size | | |
| |-----------|---------|---------------------| | |
| | `encode_256` | 256 | 256×256 | | |
| | `encode_576` | 576 | 384×384 | | |
| | `encode_1024` | 1024 | 512×512 | | |
| | `encode_2304` | 2304 | 768×768 | | |
| ### Vision Adapter | |
| | Signature | Merged Tokens | From Patches | | |
| |-----------|---------------|--------------| | |
| | `adapt_64` | 64 | 256 | | |
| | `adapt_144` | 144 | 576 | | |
| | `adapt_256` | 256 | 1024 | | |
| | `adapt_576` | 576 | 2304 | | |
| ## Usage | |
| ### Python (ai-edge-litert) | |
| ```python | |
| import numpy as np | |
| from ai_edge_litert import interpreter as tfl_interpreter | |
| # Load model | |
| interp = tfl_interpreter.Interpreter(model_path="qwen35_mm_q8_ekv2048.tflite") | |
| interp.allocate_tensors() | |
| # Initialize KV cache (24 layers, mixed shapes) | |
| kv_cache = {} # See inference_tflite.py for full initialization | |
| # Prefill | |
| prefill_runner = interp.get_signature_runner("prefill_64") | |
| tokens = np.array([[...]], dtype=np.int32) # Padded to 64 | |
| input_pos = np.arange(64, dtype=np.int32) | |
| output = prefill_runner(tokens=tokens, input_pos=input_pos, **kv_cache) | |
| # Decode loop | |
| decode_runner = interp.get_signature_runner("decode") | |
| for step in range(max_tokens): | |
| output = decode_runner(tokens=next_token, input_pos=pos, **kv_cache) | |
| next_token = np.argmax(output["logits"][0, -1]) | |
| ``` | |
| ### Tokenizer | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("g-ntovas/Qwen3.5-0.8B-LiteRT") | |
| ``` | |
| ## Conversion Details | |
| - **Source**: [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) (multimodal model) | |
| - **Method**: Custom re-authoring using LiteRT Generative API | |
| - **Quantization**: Dynamic INT8 (`dynamic_int8`) | |
| - **Export**: Per-signature tracing with fixed prefill lengths and patch counts | |
| - **Vision**: Encoder and adapter exported as separate TFLite models, bundled into `.litertlm` | |
| ## Limitations | |
| - Video input is not yet supported (encoder architecture supports it, but the data processor returns UNIMPLEMENTED for video) | |
| - Prompts are padded to the nearest prefill signature length, which may introduce minor quality differences for the linear attention layers | |
| - The recurrent GatedDeltaNet implementation may produce slightly different outputs compared to the chunk-based HuggingFace implementation due to floating-point operation ordering | |
| ## License | |
| This model inherits the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0) from the original [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) model. | |
| ## Citation | |
| If you use this model, please cite the original Qwen3.5 paper: | |
| ```bibtex | |
| @misc{qwen3.5, | |
| title={Qwen3.5 Technical Report}, | |
| author={Qwen Team}, | |
| year={2026}, | |
| url={https://huggingface.co/Qwen/Qwen3.5-0.8B} | |
| } | |
| ``` | |