MOSAIC-4B
MOSAIC-4B is an efficient heterogeneous Vision-Language Model derived from Qwen3-VL-4B-Instruct via the MOSAIC (Multi-Objective Search for Adaptive Inter-layer Composition) method. MOSAIC automatically transforms homogeneous transformer architectures into optimized heterogeneous designs through hardware-aware neural architecture search.
Paper: MOSAIC: Adaptive Inter-layer Composition for Efficient Heterogeneous Vision-Language Models (arXiv preprint) Authors: Yuncheng Yang*, Feiyang Ye*, Shixian Luo, Yinna Zhu, Lianlei Shan, Wangcai Zhao, Kuo Zhang, Yan Chen, Yong Wu†, Xie Yan — LiAuto Inc.
Highlights
| Metric | Value |
|---|---|
| Decoding speedup (TPOT) | 2.54× vs. Qwen3-VL-4B-Instruct |
| Prefilling speedup (TTFT @ 96k tokens) | 1.76× vs. Qwen3-VL-4B-Instruct |
| Performance gap (19 benchmarks avg) | −0.6% on image, −0.8% on video |
| Training cost | < 2% of original Qwen3-VL-4B-Instruct |
Key Advantages
Hardware-aware automatic architecture search. MOSAIC formulates per-layer operator selection as a multi-objective Mixed Integer Programming (MIP) problem, maximizing downstream performance under strict hardware latency constraints — no manual trial-and-error needed.
Heterogeneous operator mixing. Each of the 36 transformer layers can independently use full attention (GQA), sliding window attention (SWA), linear attention (KDA / GDN), or low-rank attention (MLA). This fine-grained flexibility reaches the optimal performance-efficiency frontier that hand-designed fixed-ratio patterns cannot.
Matches teacher performance at a fraction of the training cost. MOSAIC-4B matches Qwen3-VL-4B-Instruct on image understanding (avg Δ = −0.6%) and video understanding (avg Δ = −0.8%) across 19 representative benchmarks while using only ~32M publicly available training samples — less than 2% of the original model's training compute.
Scalable inference acceleration. The speedup grows with sequence length: TPOT reaches 2.54× at 1k decode length, 2.68× at 16k, and 2.72× at 256k tokens, making MOSAIC-4B especially efficient for long-context and long-generation workloads.
Principled two-stage parameter recovery. Structural transitions are stabilized via (1) global off-policy distillation to align internal representations, followed by (2) dual-teacher on-policy distillation using a 235B oracle teacher for knowledge expansion alongside the original 4B teacher for distributional stability.
Architecture
The figure below shows the per-layer operator assignment and relative runtime reduction for MOSAIC-4B (1.5× speedup target). Green bars indicate saved runtime compared to the original full-attention layer.
Installation
pip install transformers torch
pip install flash-linear-attention # required for linear attention operators (KDA, GDN, MLA)
Usage
This model uses a custom architecture and requires trust_remote_code=True.
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
model_id = "LiAuto-DSR/MOSAIC-4B"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://example.com/image.jpg"},
{"type": "text", "text": "Describe this image in detail."},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], return_tensors="pt").to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
Dependencies
| Package | Version |
|---|---|
| transformers | ≥ 4.57.0 |
| torch | ≥ 2.0 |
| flash-linear-attention (fla) | latest |
Citation
@article{yang2026mosaic,
title = {MOSAIC: Adaptive Inter-layer Composition for Efficient Heterogeneous Vision-Language Models},
author = {Yang, Yuncheng and Ye, Feiyang and Luo, Shixian and Zhu, Yinna and Shan, Lianlei and Zhao, Wangcai and Zhang, Kuo and Chen, Yan and Wu, Yong and Yan, Xie},
journal = {arXiv preprint},
year = {2026}
}
License
This model is released under the Apache 2.0 license. The base model weights are derived from Qwen3-VL-4B-Instruct, which is licensed under Qwen Research License.
- Downloads last month
- -
Model tree for LiAuto-DSR/MOSAIC-4B
Base model
Qwen/Qwen3-VL-4B-Instruct
