---
title: MOSS-VL-Base-0408
date: 2026-04-08
category: Multimodal-LLM
status: Base
language:
- en
library_name: transformers
pipeline_tag: video-text-to-text
license: apache-2.0
base_model: fnlp-vision/moss-video-preview-base
tags:
- Base
- Video-Understanding
- Image-Understanding
- MOSS-VL
- OpenMOSS
- multimodal
- video
- vision-language
---
# MOSS-VL-Base-0408
## 📌 Introduction
MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of the OpenMOSS ecosystem dedicated to advancing open multimodal foundation models.
Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal foundation model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation:
1. Stage 1: Vision-language alignment
2. Stage 2: Large-scale multimodal pretraining
3. Stage 3: High-quality multimodal pretraining
4. Stage 4: Annealing and long-context extension
### ✨ Highlights
- 🎬 **Native Video Understanding Foundation** — Supports native video frame sequence processing for temporal perception and general multimodal representation learning.
- 🖼️ **Strong General Multimodal Perception** — Covers single-image, multi-image, and mixed-modality offline understanding workloads.
- 🧱 **Robust Base for Adaptation** — Serves as the pretrained backbone for future SFT, alignment, and task-specific adaptation.
### 📝 Note on Variants
> [!IMPORTANT]
> **This is the base checkpoint.** It has **NOT** undergone supervised fine-tuning (SFT) or instruction tuning, and it is not the streaming variant. If you are looking for a user-facing instruction-following model, please refer to the corresponding instruct release.
---
## 🏗 Model Architecture
**MOSS-VL-Base-0408** adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. Natively supporting interleaved modalities, it provides a flexible multimodal backbone for image and video understanding while preserving a clean foundation for downstream alignment and adaptation.
## 🧩 Absolute Timestamps
To help the model perceive the pacing and duration of events, **MOSS-VL-Base-0408** injects absolute timestamps alongside sampled video frames, giving the reasoning process an explicit temporal reference even at the pretrained base stage.
## 🧬 Cross-attention RoPE (XRoPE)
MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention-based vision-language architecture. This mechanism maps text tokens and visual features into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w), improving spatial-temporal grounding during multimodal reasoning.
## 📊 Model Performance
This release focuses on providing a strong pretrained foundation rather than a fully instruction-tuned evaluation snapshot. Public benchmark tables and detailed metrics for the base checkpoint will be released in future updates.
## 🚀 Quickstart
### 🛠️ Requirements
Installation commands:
```bash
conda create -n moss_vl python=3.12 pip -y
conda activate moss_vl
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
```
### 🏃 Run Inference
Single-image offline inference (Python)
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "path/to/checkpoint"
image_path = "data/example_image.jpg"
def load_model(checkpoint: str):
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
return model, processor
model, processor = load_model(checkpoint)
text = model.offline_image_generate(
processor,
prompt="",
image=image_path,
shortest_edge=4096,
longest_edge=16777216,
multi_image_max_pixels=201326592,
patch_size=16,
temporal_patch_size=1,
merge_size=2,
image_mean=[0.5, 0.5, 0.5],
image_std=[0.5, 0.5, 0.5],
max_new_tokens=256,
temperature=1.0,
top_k=50,
top_p=1.0,
repetition_penalty=1.0,
do_sample=False,
vision_chunked_length=64,
use_template=False,
)
print(text)
```
Single-video offline inference (Python)
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "path/to/checkpoint"
video_path = "data/example_video.mp4"
def load_model(checkpoint: str):
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
return model, processor
model, processor = load_model(checkpoint)
text = model.offline_video_generate(
processor,
prompt="",
video=video_path,
shortest_edge=4096,
longest_edge=16777216,
video_max_pixels=201326592,
patch_size=16,
temporal_patch_size=1,
merge_size=2,
video_fps=1.0,
min_frames=1,
max_frames=256,
num_extract_threads=4,
image_mean=[0.5, 0.5, 0.5],
image_std=[0.5, 0.5, 0.5],
max_new_tokens=256,
temperature=1.0,
top_k=50,
top_p=1.0,
repetition_penalty=1.0,
do_sample=False,
vision_chunked_length=64,
use_template=False,
)
print(text)
```
Batched offline inference (Python)
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "path/to/checkpoint"
shared_generate_kwargs = {
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"max_new_tokens": 256,
"repetition_penalty": 1.0,
"do_sample": False,
}
shared_video_media_kwargs = {
"min_pixels": 4096,
"max_pixels": 16777216,
"video_max_pixels": 201326592,
"video_fps": 1.0,
"min_frames": 1,
"max_frames": 256,
}
def load_model(checkpoint: str):
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
return model, processor
model, processor = load_model(checkpoint)
queries = [
{
"images": ["data/sample_a.jpg"],
"generate_kwargs": dict(shared_generate_kwargs),
},
{
"videos": ["data/sample_b.mp4"],
"media_kwargs": dict(shared_video_media_kwargs),
"generate_kwargs": dict(shared_generate_kwargs),
},
]
with torch.no_grad():
result = model.offline_batch_generate(
processor,
queries,
session_states=None,
vision_chunked_length=64,
)
texts = [item["text"] for item in result["results"]]
```
## 🎯 Intended Use
- offline image understanding
- offline video understanding
- multimodal prompt experiments for release validation
- checkpoint-level inference integration and debugging
## 🚧 Limitations and Future Work
MOSS-VL-Base-0408 is a pretrained base checkpoint intended primarily as a foundation model, and several release items are still being finalized:
- realtime usage is not documented here
- benchmark, metric, and training details are still blank
- some sections are intentionally placeholders until release information is finalized
- batch calls currently require shared `generate_kwargs` and shared `media_kwargs` within one call
- batch streaming and batch cancel / stop protocol are not part of `offline_batch_generate(...)`
> [!NOTE]
> We expect future releases to expand public evaluation coverage and provide stronger downstream aligned variants built on top of this base checkpoint.
## 📜 Citation
```bibtex
@misc{moss_vl_2026,
title = {{MOSS-VL Technical Report}},
author = {OpenMOSS Team},
year = {2026},
howpublished = {\url{https://github.com/fnlp-vision/MOSS-VL}},
note = {GitHub repository}
}
```