---
license: apache-2.0
language:
  - en
  - zh
tags:
  - vision-language
  - multimodal
  - heterogeneous
  - neural-architecture-search
base_model: Qwen/Qwen3-VL-4B-Instruct
pipeline_tag: image-text-to-text
---

# MOSAIC-4B

**MOSAIC-4B** is an efficient heterogeneous Vision-Language Model derived from [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) via the **MOSAIC** (**M**ulti-**O**bjective **S**earch for **A**daptive **I**nter-layer **C**omposition) method. MOSAIC automatically transforms homogeneous transformer architectures into optimized heterogeneous designs through hardware-aware neural architecture search.

> **Paper:** *MOSAIC: Adaptive Inter-layer Composition for Efficient Heterogeneous Vision-Language Models* (arXiv preprint)
> **Authors:** Yuncheng Yang\*, Feiyang Ye\*, Shixian Luo, Yinna Zhu, Lianlei Shan, Wangcai Zhao, Kuo Zhang, Yan Chen, Yong Wu†, Xie Yan — LiAuto Inc.

---

## Highlights

![MOSAIC-4B vs baselines: average performance vs decoding acceleration](figure1.png)

| Metric | Value |
|--------|-------|
| **Decoding speedup (TPOT)** | **2.54×** vs. Qwen3-VL-4B-Instruct |
| **Prefilling speedup (TTFT @ 96k tokens)** | **1.76×** vs. Qwen3-VL-4B-Instruct |
| **Performance gap (19 benchmarks avg)** | **−0.6%** on image, **−0.8%** on video |
| **Training cost** | **< 2%** of original Qwen3-VL-4B-Instruct |

### Key Advantages

- **Hardware-aware automatic architecture search.** MOSAIC formulates per-layer operator selection as a multi-objective Mixed Integer Programming (MIP) problem, maximizing downstream performance under strict hardware latency constraints — no manual trial-and-error needed.

- **Heterogeneous operator mixing.** Each of the 36 transformer layers can independently use full attention (GQA), sliding window attention (SWA), linear attention (KDA / GDN), or low-rank attention (MLA). This fine-grained flexibility reaches the optimal performance-efficiency frontier that hand-designed fixed-ratio patterns cannot.

- **Matches teacher performance at a fraction of the training cost.** MOSAIC-4B matches Qwen3-VL-4B-Instruct on image understanding (avg Δ = −0.6%) and video understanding (avg Δ = −0.8%) across 19 representative benchmarks while using only ~32M publicly available training samples — less than 2% of the original model's training compute.

- **Scalable inference acceleration.** The speedup grows with sequence length: TPOT reaches 2.54× at 1k decode length, 2.68× at 16k, and 2.72× at 256k tokens, making MOSAIC-4B especially efficient for long-context and long-generation workloads.

- **Principled two-stage parameter recovery.** Structural transitions are stabilized via (1) global off-policy distillation to align internal representations, followed by (2) dual-teacher on-policy distillation using a 235B oracle teacher for knowledge expansion alongside the original 4B teacher for distributional stability.

---

## Architecture

The figure below shows the per-layer operator assignment and relative runtime reduction for MOSAIC-4B (1.5× speedup target). Green bars indicate saved runtime compared to the original full-attention layer.

![MOSAIC-4B per-layer architecture: operator assignment and runtime reduction](arch_1.5x.png)

---

## Installation

```bash
pip install transformers torch
pip install flash-linear-attention  # required for linear attention operators (KDA, GDN, MLA)
```

---

## Usage

This model uses a custom architecture and requires `trust_remote_code=True`.

```python
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "LiAuto-DSR/MOSAIC-4B"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://example.com/image.jpg"},
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=512)

response = processor.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
```

---

## Dependencies

| Package | Version |
|---------|---------|
| transformers | ≥ 4.57.0 |
| torch | ≥ 2.0 |
| flash-linear-attention (fla) | latest |

---

## Citation

```bibtex
@article{yang2026mosaic,
  title     = {MOSAIC: Adaptive Inter-layer Composition for Efficient Heterogeneous Vision-Language Models},
  author    = {Yang, Yuncheng and Ye, Feiyang and Luo, Shixian and Zhu, Yinna and Shan, Lianlei and Zhao, Wangcai and Zhang, Kuo and Chen, Yan and Wu, Yong and Yan, Xie},
  journal = {arXiv preprint},
  year      = {2026}
}
```

---

## License

This model is released under the **Apache 2.0** license.
The base model weights are derived from [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct), which is licensed under [Qwen Research License](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct/blob/main/LICENSE).