---
title: MOSS-VL-Base-0408
date: 2026-04-08
category: Multimodal-LLM
status: Base
language:
- en
library_name: transformers
pipeline_tag: video-text-to-text
license: apache-2.0
base_model: fnlp-vision/moss-video-preview-base
tags:
- Base
- Video-Understanding
- Image-Understanding
- MOSS-VL
- OpenMOSS
- multimodal
- video
- vision-language
---

<p align="center">
   <img src="assets/logo.png" width="320"/>
</p>

# MOSS-VL-Base-0408

## 📌 Introduction

MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing open multimodal foundation models.

Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal foundation model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation:

1. Stage 1: Vision-language alignment
2. Stage 2: Large-scale multimodal pretraining
3. Stage 3: High-quality multimodal pretraining
4. Stage 4: Annealing and long-context extension

### ✨ Highlights

- 🎬 **Native Video Understanding Foundation** — Supports native video frame sequence processing for temporal perception and general multimodal representation learning.
- 🖼️ **Strong General Multimodal Perception** — Covers single-image, multi-image, and mixed-modality offline understanding workloads.
- 🧱 **Robust Base for Adaptation** — Serves as the pretrained backbone for future SFT, alignment, and task-specific adaptation.

### 📝 Note on Variants

> [!IMPORTANT]
> **This is the base checkpoint.** It has **NOT** undergone supervised fine-tuning (SFT) or instruction tuning, and it is not the streaming variant. If you are looking for a user-facing instruction-following model, please refer to the corresponding instruct release.

---

## 🏗 Model Architecture

**MOSS-VL-Base-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. Natively supporting interleaved modalities, it provides a flexible multimodal backbone for image and video understanding while preserving a clean foundation for downstream alignment and adaptation.

<p align="center">
    <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
</p>

## 🧩 Absolute Timestamps

To help the model perceive the pacing and duration of events, **MOSS-VL-Base-0408** injects absolute timestamps alongside sampled video frames, giving the reasoning process an explicit temporal reference even at the pretrained base stage.

<p align="center">
    <img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
</p>

## 🧬 Cross-attention RoPE (XRoPE)

MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention-based vision-language architecture. This mechanism maps text tokens and visual features into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w), improving spatial-temporal grounding during multimodal reasoning.

<p align="center">
    <img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
</p>

## 📊 Model Performance

This release focuses on providing a strong pretrained foundation rather than a fully instruction-tuned evaluation snapshot. Public benchmark tables and detailed metrics for the base checkpoint will be released in future updates.

## 🚀 Quickstart
### 🛠️ Requirements

Installation commands:

```bash
conda create -n moss_vl python=3.12 pip -y
conda activate moss_vl
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
```

### 🏃 Run Inference

<details>
<summary><strong>Single-image offline inference (Python)</strong></summary>

<br>

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
image_path = "data/example_image.jpg"


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)

text = model.offline_image_generate(
    processor,
    prompt="",
    image=image_path,
    shortest_edge=4096,
    longest_edge=16777216,
    multi_image_max_pixels=201326592,
    patch_size=16,
    temporal_patch_size=1,
    merge_size=2,
    image_mean=[0.5, 0.5, 0.5],
    image_std=[0.5, 0.5, 0.5],
    max_new_tokens=256,
    temperature=1.0,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1.0,
    do_sample=False,
    vision_chunked_length=64,
    use_template=False,
)

print(text)
```

</details>

<details>
<summary><strong>Single-video offline inference (Python)</strong></summary>

<br>

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
video_path = "data/example_video.mp4"


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)

text = model.offline_video_generate(
    processor,
    prompt="",
    video=video_path,
    shortest_edge=4096,
    longest_edge=16777216,
    video_max_pixels=201326592,
    patch_size=16,
    temporal_patch_size=1,
    merge_size=2,
    video_fps=1.0,
    min_frames=1,
    max_frames=256,
    num_extract_threads=4,
    image_mean=[0.5, 0.5, 0.5],
    image_std=[0.5, 0.5, 0.5],
    max_new_tokens=256,
    temperature=1.0,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1.0,
    do_sample=False,
    vision_chunked_length=64,
    use_template=False,
)

print(text)
```

</details>

<details>
<summary><strong>Batched offline inference (Python)</strong></summary>

<br>

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
shared_generate_kwargs = {
    "temperature": 1.0,
    "top_k": 50,
    "top_p": 1.0,
    "max_new_tokens": 256,
    "repetition_penalty": 1.0,
    "do_sample": False,
}
shared_video_media_kwargs = {
    "min_pixels": 4096,
    "max_pixels": 16777216,
    "video_max_pixels": 201326592,
    "video_fps": 1.0,
    "min_frames": 1,
    "max_frames": 256,
}


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)
queries = [
    {
        "images": ["data/sample_a.jpg"],
        "generate_kwargs": dict(shared_generate_kwargs),
    },
    {
        "videos": ["data/sample_b.mp4"],
        "media_kwargs": dict(shared_video_media_kwargs),
        "generate_kwargs": dict(shared_generate_kwargs),
    },
]

with torch.no_grad():
    result = model.offline_batch_generate(
        processor,
        queries,
        session_states=None,
        vision_chunked_length=64,
    )

texts = [item["text"] for item in result["results"]]
```

</details>

## 🎯 Intended Use

- offline image understanding
- offline video understanding
- multimodal prompt experiments for release validation
- checkpoint-level inference integration and debugging

## 🚧 Limitations and Future Work

MOSS-VL-Base-0408 is a pretrained base checkpoint intended primarily as a foundation model, and several release items are still being finalized:

- realtime usage is not documented here
- benchmark, metric, and training details are still blank
- some sections are intentionally placeholders until release information is finalized
- batch calls currently require shared `generate_kwargs` and shared `media_kwargs` within one call
- batch streaming and batch cancel / stop protocol are not part of `offline_batch_generate(...)`

> [!NOTE]
> We expect future releases to expand public evaluation coverage and provide stronger downstream aligned variants built on top of this base checkpoint.

## 📜 Citation
```bibtex
@misc{moss_vl_2026,
  title         = {{MOSS-VL Technical Report}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/fnlp-vision/MOSS-VL}},
  note          = {GitHub repository}
}
```