--- license: apache-2.0 language: - en - zh tags: - vision-language - multimodal - heterogeneous - neural-architecture-search base_model: Qwen/Qwen3-VL-4B-Instruct pipeline_tag: image-text-to-text --- # MOSAIC-4B **MOSAIC-4B** is an efficient heterogeneous Vision-Language Model derived from [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) via the **MOSAIC** (**M**ulti-**O**bjective **S**earch for **A**daptive **I**nter-layer **C**omposition) method. MOSAIC automatically transforms homogeneous transformer architectures into optimized heterogeneous designs through hardware-aware neural architecture search. > **Paper:** *MOSAIC: Adaptive Inter-layer Composition for Efficient Heterogeneous Vision-Language Models* (arXiv preprint) > **Authors:** Yuncheng Yang\*, Feiyang Ye\*, Shixian Luo, Yinna Zhu, Lianlei Shan, Wangcai Zhao, Kuo Zhang, Yan Chen, Yong Wu†, Xie Yan — LiAuto Inc. --- ## Highlights ![MOSAIC-4B vs baselines: average performance vs decoding acceleration](figure1.png) | Metric | Value | |--------|-------| | **Decoding speedup (TPOT)** | **2.54×** vs. Qwen3-VL-4B-Instruct | | **Prefilling speedup (TTFT @ 96k tokens)** | **1.76×** vs. Qwen3-VL-4B-Instruct | | **Performance gap (19 benchmarks avg)** | **−0.6%** on image, **−0.8%** on video | | **Training cost** | **< 2%** of original Qwen3-VL-4B-Instruct | ### Key Advantages - **Hardware-aware automatic architecture search.** MOSAIC formulates per-layer operator selection as a multi-objective Mixed Integer Programming (MIP) problem, maximizing downstream performance under strict hardware latency constraints — no manual trial-and-error needed. - **Heterogeneous operator mixing.** Each of the 36 transformer layers can independently use full attention (GQA), sliding window attention (SWA), linear attention (KDA / GDN), or low-rank attention (MLA). This fine-grained flexibility reaches the optimal performance-efficiency frontier that hand-designed fixed-ratio patterns cannot. - **Matches teacher performance at a fraction of the training cost.** MOSAIC-4B matches Qwen3-VL-4B-Instruct on image understanding (avg Δ = −0.6%) and video understanding (avg Δ = −0.8%) across 19 representative benchmarks while using only ~32M publicly available training samples — less than 2% of the original model's training compute. - **Scalable inference acceleration.** The speedup grows with sequence length: TPOT reaches 2.54× at 1k decode length, 2.68× at 16k, and 2.72× at 256k tokens, making MOSAIC-4B especially efficient for long-context and long-generation workloads. - **Principled two-stage parameter recovery.** Structural transitions are stabilized via (1) global off-policy distillation to align internal representations, followed by (2) dual-teacher on-policy distillation using a 235B oracle teacher for knowledge expansion alongside the original 4B teacher for distributional stability. --- ## Architecture The figure below shows the per-layer operator assignment and relative runtime reduction for MOSAIC-4B (1.5× speedup target). Green bars indicate saved runtime compared to the original full-attention layer. ![MOSAIC-4B per-layer architecture: operator assignment and runtime reduction](arch_1.5x.png) --- ## Installation ```bash pip install transformers torch pip install flash-linear-attention # required for linear attention operators (KDA, GDN, MLA) ``` --- ## Usage This model uses a custom architecture and requires `trust_remote_code=True`. ```python from transformers import AutoProcessor, AutoModelForImageTextToText import torch model_id = "LiAuto-DSR/MOSAIC-4B" processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForImageTextToText.from_pretrained( model_id, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto", ) messages = [ { "role": "user", "content": [ {"type": "image", "image": "https://example.com/image.jpg"}, {"type": "text", "text": "Describe this image in detail."}, ], } ] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=[text], return_tensors="pt").to(model.device) with torch.no_grad(): output_ids = model.generate(**inputs, max_new_tokens=512) response = processor.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) print(response) ``` --- ## Dependencies | Package | Version | |---------|---------| | transformers | ≥ 4.57.0 | | torch | ≥ 2.0 | | flash-linear-attention (fla) | latest | --- ## Citation ```bibtex @article{yang2026mosaic, title = {MOSAIC: Adaptive Inter-layer Composition for Efficient Heterogeneous Vision-Language Models}, author = {Yang, Yuncheng and Ye, Feiyang and Luo, Shixian and Zhu, Yinna and Shan, Lianlei and Zhao, Wangcai and Zhang, Kuo and Chen, Yan and Wu, Yong and Yan, Xie}, journal = {arXiv preprint}, year = {2026} } ``` --- ## License This model is released under the **Apache 2.0** license. The base model weights are derived from [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct), which is licensed under [Qwen Research License](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct/blob/main/LICENSE).