--- license: apache-2.0 base_model: - Qwen/Qwen3.5-0.8B pipeline_tag: image-text-to-text library_name: litert-lm tags: - Qwen3.5 - litert - litert-lm - tflite - on-device - hybrid-attention - GatedDeltaNet - multimodal - vision --- # Qwen3.5-0.8B LiteRT (Multimodal) This repository contains a [LiteRT](https://ai.google.dev/edge/litert) (formerly TFLite) conversion of [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) for on-device inference, packaged in the [LiteRT-LM](https://github.com/nicfv/litert-torch) `.litertlm` format. Includes the **full multimodal pipeline**: language model, vision encoder, and vision adapter for image understanding. ## Model Details | Property | Value | |----------|-------| | **Base Model** | [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) | | **Architecture** | Hybrid attention (GatedDeltaNet + Full Attention) + ViT vision encoder | | **Parameters** | 752M (language) + 675M (vision encoder) + 10M (vision adapter) | | **Quantization** | Dynamic INT8 | | **KV Cache Length** | 2048 | | **Prefill Signatures** | 64, 128, 256, 512 | | **Vision Signatures** | 256, 576, 1024, 2304 patches | | **Format** | `.litertlm` (LiteRT-LM container) | ## Architecture ### Language Model Qwen3.5-0.8B uses a **hybrid attention** architecture that combines: - **18 GatedDeltaNet layers** (linear attention with recurrent delta rule) at positions 0-2, 4-6, 8-10, 12-14, 16-18, 20-22 - **6 Full Attention layers** (standard multi-head attention with output gating and partial RoPE) at positions 3, 7, 11, 15, 19, 23 ### Vision Encoder The vision encoder is a 27-layer Vision Transformer (ViT): - **Patch embedding**: Conv3d (3→1152, kernel=[2,16,16]) with learned position embeddings (bilinear interpolation from 48×48 grid) - **27 VisionBlocks**: LayerNorm → Self-Attention (16 heads, head_dim=72, 2D rotary pos emb) → MLP (1152→4304→1152, GELU) - **Patch merger** (vision adapter): Groups 4 adjacent patches (spatial_merge_size=2) and projects to language model dimension (4608→1024) The model was **re-authored from scratch** using the LiteRT Generative API. The vision encoder and adapter are exported as separate TFLite models bundled alongside the language model. ## Files | File | Size | Description | |------|------|-------------| | `qwen35_mm_q8_ekv2048.litertlm` | ~1.2 GB | LiteRT-LM bundle (LM + vision encoder + vision adapter + tokenizer) | | `qwen35_mm_q8_ekv2048.tflite` | ~757 MB | Language model TFLite | | `qwen35_vision_encoder_q8.tflite` | ~88 MB | Vision encoder TFLite | | `qwen35_vision_adapter_q8.tflite` | ~12 MB | Vision adapter TFLite | | `qwen35_embedder_q8.tflite` | ~245 MB | Text embedder TFLite | | `tokenizer.json` | ~11 MB | HuggingFace tokenizer | | `tokenizer_config.json` | ~2 KB | Tokenizer configuration | ## Signatures ### Language Model | Signature | Input Length | Outputs | |-----------|-------------|---------| | `prefill_64` | 64 tokens | Updated KV cache | | `prefill_128` | 128 tokens | Updated KV cache | | `prefill_256` | 256 tokens | Updated KV cache | | `prefill_512` | 512 tokens | Updated KV cache | | `decode` | 1 token | Logits + Updated KV cache | ### Vision Encoder | Signature | Patches | Approx. Image Size | |-----------|---------|---------------------| | `encode_256` | 256 | 256×256 | | `encode_576` | 576 | 384×384 | | `encode_1024` | 1024 | 512×512 | | `encode_2304` | 2304 | 768×768 | ### Vision Adapter | Signature | Merged Tokens | From Patches | |-----------|---------------|--------------| | `adapt_64` | 64 | 256 | | `adapt_144` | 144 | 576 | | `adapt_256` | 256 | 1024 | | `adapt_576` | 576 | 2304 | ## Usage ### Python (ai-edge-litert) ```python import numpy as np from ai_edge_litert import interpreter as tfl_interpreter # Load model interp = tfl_interpreter.Interpreter(model_path="qwen35_mm_q8_ekv2048.tflite") interp.allocate_tensors() # Initialize KV cache (24 layers, mixed shapes) kv_cache = {} # See inference_tflite.py for full initialization # Prefill prefill_runner = interp.get_signature_runner("prefill_64") tokens = np.array([[...]], dtype=np.int32) # Padded to 64 input_pos = np.arange(64, dtype=np.int32) output = prefill_runner(tokens=tokens, input_pos=input_pos, **kv_cache) # Decode loop decode_runner = interp.get_signature_runner("decode") for step in range(max_tokens): output = decode_runner(tokens=next_token, input_pos=pos, **kv_cache) next_token = np.argmax(output["logits"][0, -1]) ``` ### Tokenizer ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("g-ntovas/Qwen3.5-0.8B-LiteRT") ``` ## Conversion Details - **Source**: [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) (multimodal model) - **Method**: Custom re-authoring using LiteRT Generative API - **Quantization**: Dynamic INT8 (`dynamic_int8`) - **Export**: Per-signature tracing with fixed prefill lengths and patch counts - **Vision**: Encoder and adapter exported as separate TFLite models, bundled into `.litertlm` ## Limitations - Video input is not yet supported (encoder architecture supports it, but the data processor returns UNIMPLEMENTED for video) - Prompts are padded to the nearest prefill signature length, which may introduce minor quality differences for the linear attention layers - The recurrent GatedDeltaNet implementation may produce slightly different outputs compared to the chunk-based HuggingFace implementation due to floating-point operation ordering ## License This model inherits the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0) from the original [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) model. ## Citation If you use this model, please cite the original Qwen3.5 paper: ```bibtex @misc{qwen3.5, title={Qwen3.5 Technical Report}, author={Qwen Team}, year={2026}, url={https://huggingface.co/Qwen/Qwen3.5-0.8B} } ```