MOSAIC-4B

MOSAIC-4B is an efficient heterogeneous Vision-Language Model derived from Qwen3-VL-4B-Instruct via the MOSAIC (Multi-Objective Search for Adaptive Inter-layer Composition) method. MOSAIC automatically transforms homogeneous transformer architectures into optimized heterogeneous designs through hardware-aware neural architecture search.

Paper: MOSAIC: Adaptive Inter-layer Composition for Efficient Heterogeneous Vision-Language Models (arXiv preprint) Authors: Yuncheng Yang*, Feiyang Ye*, Shixian Luo, Yinna Zhu, Lianlei Shan, Wangcai Zhao, Kuo Zhang, Yan Chen, Yong Wu†, Xie Yan — LiAuto Inc.


Highlights

MOSAIC-4B vs baselines: average performance vs decoding acceleration

Metric Value
Decoding speedup (TPOT) 2.54× vs. Qwen3-VL-4B-Instruct
Prefilling speedup (TTFT @ 96k tokens) 1.76× vs. Qwen3-VL-4B-Instruct
Performance gap (19 benchmarks avg) −0.6% on image, −0.8% on video
Training cost < 2% of original Qwen3-VL-4B-Instruct

Key Advantages

  • Hardware-aware automatic architecture search. MOSAIC formulates per-layer operator selection as a multi-objective Mixed Integer Programming (MIP) problem, maximizing downstream performance under strict hardware latency constraints — no manual trial-and-error needed.

  • Heterogeneous operator mixing. Each of the 36 transformer layers can independently use full attention (GQA), sliding window attention (SWA), linear attention (KDA / GDN), or low-rank attention (MLA). This fine-grained flexibility reaches the optimal performance-efficiency frontier that hand-designed fixed-ratio patterns cannot.

  • Matches teacher performance at a fraction of the training cost. MOSAIC-4B matches Qwen3-VL-4B-Instruct on image understanding (avg Δ = −0.6%) and video understanding (avg Δ = −0.8%) across 19 representative benchmarks while using only ~32M publicly available training samples — less than 2% of the original model's training compute.

  • Scalable inference acceleration. The speedup grows with sequence length: TPOT reaches 2.54× at 1k decode length, 2.68× at 16k, and 2.72× at 256k tokens, making MOSAIC-4B especially efficient for long-context and long-generation workloads.

  • Principled two-stage parameter recovery. Structural transitions are stabilized via (1) global off-policy distillation to align internal representations, followed by (2) dual-teacher on-policy distillation using a 235B oracle teacher for knowledge expansion alongside the original 4B teacher for distributional stability.


Architecture

The figure below shows the per-layer operator assignment and relative runtime reduction for MOSAIC-4B (1.5× speedup target). Green bars indicate saved runtime compared to the original full-attention layer.

MOSAIC-4B per-layer architecture: operator assignment and runtime reduction


Installation

pip install transformers torch
pip install flash-linear-attention  # required for linear attention operators (KDA, GDN, MLA)

Usage

This model uses a custom architecture and requires trust_remote_code=True.

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "LiAuto-DSR/MOSAIC-4B"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://example.com/image.jpg"},
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=512)

response = processor.decode(output_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Dependencies

Package Version
transformers ≥ 4.57.0
torch ≥ 2.0
flash-linear-attention (fla) latest

Citation

@article{yang2026mosaic,
  title     = {MOSAIC: Adaptive Inter-layer Composition for Efficient Heterogeneous Vision-Language Models},
  author    = {Yang, Yuncheng and Ye, Feiyang and Luo, Shixian and Zhu, Yinna and Shan, Lianlei and Zhao, Wangcai and Zhang, Kuo and Chen, Yan and Wu, Yong and Yan, Xie},
  journal = {arXiv preprint},
  year      = {2026}
}

License

This model is released under the Apache 2.0 license. The base model weights are derived from Qwen3-VL-4B-Instruct, which is licensed under Qwen Research License.

Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LiAuto-DSR/MOSAIC-4B

Finetuned
(292)
this model