Duplicate from g-ntovas/Qwen3.5-0.8B-LiteRT

766ec4a 3 months ago

5.92 kB

	---
	license: apache-2.0
	base_model:
	- Qwen/Qwen3.5-0.8B
	pipeline_tag: image-text-to-text
	library_name: litert-lm
	tags:
	- Qwen3.5
	- litert
	- litert-lm
	- tflite
	- on-device
	- hybrid-attention
	- GatedDeltaNet
	- multimodal
	- vision
	---

	# Qwen3.5-0.8B LiteRT (Multimodal)

	This repository contains a [LiteRT](https://ai.google.dev/edge/litert) (formerly TFLite) conversion of [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) for on-device inference, packaged in the [LiteRT-LM](https://github.com/nicfv/litert-torch) `.litertlm` format. Includes the full multimodal pipeline: language model, vision encoder, and vision adapter for image understanding.

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) \|
	\| Architecture \| Hybrid attention (GatedDeltaNet + Full Attention) + ViT vision encoder \|
	\| Parameters \| 752M (language) + 675M (vision encoder) + 10M (vision adapter) \|
	\| Quantization \| Dynamic INT8 \|
	\| KV Cache Length \| 2048 \|
	\| Prefill Signatures \| 64, 128, 256, 512 \|
	\| Vision Signatures \| 256, 576, 1024, 2304 patches \|
	\| Format \| `.litertlm` (LiteRT-LM container) \|

	## Architecture

	### Language Model

	Qwen3.5-0.8B uses a hybrid attention architecture that combines:

	- 18 GatedDeltaNet layers (linear attention with recurrent delta rule) at positions 0-2, 4-6, 8-10, 12-14, 16-18, 20-22
	- 6 Full Attention layers (standard multi-head attention with output gating and partial RoPE) at positions 3, 7, 11, 15, 19, 23

	### Vision Encoder

	The vision encoder is a 27-layer Vision Transformer (ViT):

	- Patch embedding: Conv3d (3→1152, kernel=[2,16,16]) with learned position embeddings (bilinear interpolation from 48×48 grid)
	- 27 VisionBlocks: LayerNorm → Self-Attention (16 heads, head_dim=72, 2D rotary pos emb) → MLP (1152→4304→1152, GELU)
	- Patch merger (vision adapter): Groups 4 adjacent patches (spatial_merge_size=2) and projects to language model dimension (4608→1024)

	The model was re-authored from scratch using the LiteRT Generative API. The vision encoder and adapter are exported as separate TFLite models bundled alongside the language model.

	## Files

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `qwen35_mm_q8_ekv2048.litertlm` \| ~1.2 GB \| LiteRT-LM bundle (LM + vision encoder + vision adapter + tokenizer) \|
	\| `qwen35_mm_q8_ekv2048.tflite` \| ~757 MB \| Language model TFLite \|
	\| `qwen35_vision_encoder_q8.tflite` \| ~88 MB \| Vision encoder TFLite \|
	\| `qwen35_vision_adapter_q8.tflite` \| ~12 MB \| Vision adapter TFLite \|
	\| `qwen35_embedder_q8.tflite` \| ~245 MB \| Text embedder TFLite \|
	\| `tokenizer.json` \| ~11 MB \| HuggingFace tokenizer \|
	\| `tokenizer_config.json` \| ~2 KB \| Tokenizer configuration \|

	## Signatures

	### Language Model

	\| Signature \| Input Length \| Outputs \|
	\|-----------\|-------------\|---------\|
	\| `prefill_64` \| 64 tokens \| Updated KV cache \|
	\| `prefill_128` \| 128 tokens \| Updated KV cache \|
	\| `prefill_256` \| 256 tokens \| Updated KV cache \|
	\| `prefill_512` \| 512 tokens \| Updated KV cache \|
	\| `decode` \| 1 token \| Logits + Updated KV cache \|

	### Vision Encoder

	\| Signature \| Patches \| Approx. Image Size \|
	\|-----------\|---------\|---------------------\|
	\| `encode_256` \| 256 \| 256×256 \|
	\| `encode_576` \| 576 \| 384×384 \|
	\| `encode_1024` \| 1024 \| 512×512 \|
	\| `encode_2304` \| 2304 \| 768×768 \|

	### Vision Adapter

	\| Signature \| Merged Tokens \| From Patches \|
	\|-----------\|---------------\|--------------\|
	\| `adapt_64` \| 64 \| 256 \|
	\| `adapt_144` \| 144 \| 576 \|
	\| `adapt_256` \| 256 \| 1024 \|
	\| `adapt_576` \| 576 \| 2304 \|

	## Usage

	### Python (ai-edge-litert)

	```python
	import numpy as np
	from ai_edge_litert import interpreter as tfl_interpreter

	# Load model
	interp = tfl_interpreter.Interpreter(model_path="qwen35_mm_q8_ekv2048.tflite")
	interp.allocate_tensors()

	# Initialize KV cache (24 layers, mixed shapes)
	kv_cache = {} # See inference_tflite.py for full initialization

	# Prefill
	prefill_runner = interp.get_signature_runner("prefill_64")
	tokens = np.array([[...]], dtype=np.int32) # Padded to 64
	input_pos = np.arange(64, dtype=np.int32)
	output = prefill_runner(tokens=tokens, input_pos=input_pos, **kv_cache)

	# Decode loop
	decode_runner = interp.get_signature_runner("decode")
	for step in range(max_tokens):
	output = decode_runner(tokens=next_token, input_pos=pos, **kv_cache)
	next_token = np.argmax(output["logits"][0, -1])
	```

	### Tokenizer

	```python
	from transformers import AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("g-ntovas/Qwen3.5-0.8B-LiteRT")
	```

	## Conversion Details

	- Source: [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) (multimodal model)
	- Method: Custom re-authoring using LiteRT Generative API
	- Quantization: Dynamic INT8 (`dynamic_int8`)
	- Export: Per-signature tracing with fixed prefill lengths and patch counts
	- Vision: Encoder and adapter exported as separate TFLite models, bundled into `.litertlm`

	## Limitations

	- Video input is not yet supported (encoder architecture supports it, but the data processor returns UNIMPLEMENTED for video)
	- Prompts are padded to the nearest prefill signature length, which may introduce minor quality differences for the linear attention layers
	- The recurrent GatedDeltaNet implementation may produce slightly different outputs compared to the chunk-based HuggingFace implementation due to floating-point operation ordering

	## License

	This model inherits the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0) from the original [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) model.

	## Citation

	If you use this model, please cite the original Qwen3.5 paper:

	```bibtex
	@misc{qwen3.5,
	title={Qwen3.5 Technical Report},
	author={Qwen Team},
	year={2026},
	url={https://huggingface.co/Qwen/Qwen3.5-0.8B}
	}
	```