Qwen3.5-0.8B-LiteRT / README.md
GabrieleConte's picture
Duplicate from g-ntovas/Qwen3.5-0.8B-LiteRT
766ec4a
|
Raw
History Blame Contribute Delete
5.92 kB
---
license: apache-2.0
base_model:
- Qwen/Qwen3.5-0.8B
pipeline_tag: image-text-to-text
library_name: litert-lm
tags:
- Qwen3.5
- litert
- litert-lm
- tflite
- on-device
- hybrid-attention
- GatedDeltaNet
- multimodal
- vision
---
# Qwen3.5-0.8B LiteRT (Multimodal)
This repository contains a [LiteRT](https://ai.google.dev/edge/litert) (formerly TFLite) conversion of [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) for on-device inference, packaged in the [LiteRT-LM](https://github.com/nicfv/litert-torch) `.litertlm` format. Includes the **full multimodal pipeline**: language model, vision encoder, and vision adapter for image understanding.
## Model Details
| Property | Value |
|----------|-------|
| **Base Model** | [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) |
| **Architecture** | Hybrid attention (GatedDeltaNet + Full Attention) + ViT vision encoder |
| **Parameters** | 752M (language) + 675M (vision encoder) + 10M (vision adapter) |
| **Quantization** | Dynamic INT8 |
| **KV Cache Length** | 2048 |
| **Prefill Signatures** | 64, 128, 256, 512 |
| **Vision Signatures** | 256, 576, 1024, 2304 patches |
| **Format** | `.litertlm` (LiteRT-LM container) |
## Architecture
### Language Model
Qwen3.5-0.8B uses a **hybrid attention** architecture that combines:
- **18 GatedDeltaNet layers** (linear attention with recurrent delta rule) at positions 0-2, 4-6, 8-10, 12-14, 16-18, 20-22
- **6 Full Attention layers** (standard multi-head attention with output gating and partial RoPE) at positions 3, 7, 11, 15, 19, 23
### Vision Encoder
The vision encoder is a 27-layer Vision Transformer (ViT):
- **Patch embedding**: Conv3d (3→1152, kernel=[2,16,16]) with learned position embeddings (bilinear interpolation from 48×48 grid)
- **27 VisionBlocks**: LayerNorm → Self-Attention (16 heads, head_dim=72, 2D rotary pos emb) → MLP (1152→4304→1152, GELU)
- **Patch merger** (vision adapter): Groups 4 adjacent patches (spatial_merge_size=2) and projects to language model dimension (4608→1024)
The model was **re-authored from scratch** using the LiteRT Generative API. The vision encoder and adapter are exported as separate TFLite models bundled alongside the language model.
## Files
| File | Size | Description |
|------|------|-------------|
| `qwen35_mm_q8_ekv2048.litertlm` | ~1.2 GB | LiteRT-LM bundle (LM + vision encoder + vision adapter + tokenizer) |
| `qwen35_mm_q8_ekv2048.tflite` | ~757 MB | Language model TFLite |
| `qwen35_vision_encoder_q8.tflite` | ~88 MB | Vision encoder TFLite |
| `qwen35_vision_adapter_q8.tflite` | ~12 MB | Vision adapter TFLite |
| `qwen35_embedder_q8.tflite` | ~245 MB | Text embedder TFLite |
| `tokenizer.json` | ~11 MB | HuggingFace tokenizer |
| `tokenizer_config.json` | ~2 KB | Tokenizer configuration |
## Signatures
### Language Model
| Signature | Input Length | Outputs |
|-----------|-------------|---------|
| `prefill_64` | 64 tokens | Updated KV cache |
| `prefill_128` | 128 tokens | Updated KV cache |
| `prefill_256` | 256 tokens | Updated KV cache |
| `prefill_512` | 512 tokens | Updated KV cache |
| `decode` | 1 token | Logits + Updated KV cache |
### Vision Encoder
| Signature | Patches | Approx. Image Size |
|-----------|---------|---------------------|
| `encode_256` | 256 | 256×256 |
| `encode_576` | 576 | 384×384 |
| `encode_1024` | 1024 | 512×512 |
| `encode_2304` | 2304 | 768×768 |
### Vision Adapter
| Signature | Merged Tokens | From Patches |
|-----------|---------------|--------------|
| `adapt_64` | 64 | 256 |
| `adapt_144` | 144 | 576 |
| `adapt_256` | 256 | 1024 |
| `adapt_576` | 576 | 2304 |
## Usage
### Python (ai-edge-litert)
```python
import numpy as np
from ai_edge_litert import interpreter as tfl_interpreter
# Load model
interp = tfl_interpreter.Interpreter(model_path="qwen35_mm_q8_ekv2048.tflite")
interp.allocate_tensors()
# Initialize KV cache (24 layers, mixed shapes)
kv_cache = {} # See inference_tflite.py for full initialization
# Prefill
prefill_runner = interp.get_signature_runner("prefill_64")
tokens = np.array([[...]], dtype=np.int32) # Padded to 64
input_pos = np.arange(64, dtype=np.int32)
output = prefill_runner(tokens=tokens, input_pos=input_pos, **kv_cache)
# Decode loop
decode_runner = interp.get_signature_runner("decode")
for step in range(max_tokens):
output = decode_runner(tokens=next_token, input_pos=pos, **kv_cache)
next_token = np.argmax(output["logits"][0, -1])
```
### Tokenizer
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("g-ntovas/Qwen3.5-0.8B-LiteRT")
```
## Conversion Details
- **Source**: [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) (multimodal model)
- **Method**: Custom re-authoring using LiteRT Generative API
- **Quantization**: Dynamic INT8 (`dynamic_int8`)
- **Export**: Per-signature tracing with fixed prefill lengths and patch counts
- **Vision**: Encoder and adapter exported as separate TFLite models, bundled into `.litertlm`
## Limitations
- Video input is not yet supported (encoder architecture supports it, but the data processor returns UNIMPLEMENTED for video)
- Prompts are padded to the nearest prefill signature length, which may introduce minor quality differences for the linear attention layers
- The recurrent GatedDeltaNet implementation may produce slightly different outputs compared to the chunk-based HuggingFace implementation due to floating-point operation ordering
## License
This model inherits the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0) from the original [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) model.
## Citation
If you use this model, please cite the original Qwen3.5 paper:
```bibtex
@misc{qwen3.5,
title={Qwen3.5 Technical Report},
author={Qwen Team},
year={2026},
url={https://huggingface.co/Qwen/Qwen3.5-0.8B}
}
```