---
title: MOSS-VL-SFT-0408
date: 2026-04-08
category: Multimodal-LLM
status: SFT
language:
- en
library_name: transformers
pipeline_tag: video-text-to-text
license: apache-2.0
base_model: fnlp-vision/moss-video-preview-base
tags:
- SFT
- Video-Understanding
- Image-Understanding
- MOSS-VL
- OpenMOSS
- multimodal
- video
- vision-language
---

<p align="center">
   <img src="assets/logo.png" width="320"/>
</p>

# MOSS-VL-SFT-0408

## 📌 Introduction

We introduce **MOSS-VL-SFT-0408**, the supervised fine-tuned checkpoint in the **MOSS-VL** series (part of the **OpenMOSS** ecosystem).

> [!IMPORTANT]
> This is an **SFT** checkpoint (instruction-tuned). It is **NOT** the Real-Time SFT streaming checkpoint.

This model is designed as a high-performance offline engine for multimodal tasks, bridging the gap between static image understanding and dynamic real-time interaction.

### This checkpoint is intended for:

-   **video/image understanding** with significantly improved instruction following capabilities.
-   Serving as a **strong starting point** for further **Real-Time SFT** or specific domain adaptation.

---

## 🚀 Key Features & Status

| Feature | Status | Description |
| :--- | :---: | :--- |
| **Model Loading** | ✅ | Standard HF loading with `trust_remote_code=True` |
| **Image Understanding** | ✅ | Single/Multi-image input support |
| **Video Understanding** | ✅ | Native video frame sequence processing |
| **Mixed Inference** | ✅ | Interleaved image and video inputs |
| **Offline Generation** | ✅ | Optimized `offline_generate` & `offline_batch_generate` |
| **Benchmarks/Metrics** | ⏳ | Coming in future updates |

---

## 🏗 Model Architecture

**MOSS-VL-SFT-0408** adopts a decoupled multimodal design, utilizing a cross-attention mechanism to bridge high-resolution visual encoding with advanced language reasoning. 

<p align="center">
    <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
    <br>
    <em>Figure 1: MOSS-VL Core Architecture.</em>
</p>


## Temporal-Aware Prompting

At the model-family level, MOSS-VL uses timestamp-aware multimodal prompting for video understanding. This design gives sampled frames explicit temporal anchors, which helps the model reason about order, duration, and event localization more robustly.

<p align="center">
    <img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
    <br>
    <em>Figure 2: Illustration of the timestamped sequence input pipeline.</em>
</p>

## Multimodal RoPE

MOSS-VL uses multimodal rotary position encoding to align text tokens and visual features in a shared spatial-temporal coordinate system. At a high level, this improves video-text grounding and helps preserve temporal structure during multimodal reasoning.

<p align="center">
    <img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
    <br>
    <em>Figure 3: 3D-RoPE spatial-temporal alignment.</em>
</p>


## 🚀 Quickstart

<details>
<summary><strong>Queue-based offline inference (Python)</strong></summary>

<br>

```python
import os
import queue
import threading

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
video_path = "data/example_video.mp4"
prompt = "Describe the video."

max_new_tokens = 1024
temperature = 1.0
top_k = 50
top_p = 1.0
repetition_penalty = 1.0

video_fps = 1.0
video_minlen = 8
video_maxlen = 256


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


if not checkpoint:
    raise ValueError("Missing `checkpoint`.")
if not video_path:
    raise ValueError("Missing `video_path`.")
if not os.path.isfile(video_path):
    raise FileNotFoundError(f"Video not found: {video_path}")

model, processor = load_model(checkpoint)
new_queries: "queue.Queue[dict]" = queue.Queue()
output_text_queue: "queue.Queue[str]" = queue.Queue()

query = {
    "prompt": prompt,
    "images": [],
    "videos": [video_path],
    "media_kwargs": {
        "video_fps": video_fps,
        "video_minlen": video_minlen,
        "video_maxlen": video_maxlen,
    },
    "generate_kwargs": {
        "temperature": temperature,
        "top_k": top_k,
        "top_p": top_p,
        "max_new_tokens": max_new_tokens,
        "repetition_penalty": repetition_penalty,
        "do_sample": False,
    },
}


def drain_output():
    while True:
        tok = output_text_queue.get()
        if tok == "<|round_end|>":
            break
        print(tok, end="", flush=True)


worker = threading.Thread(
    target=model.offline_generate,
    args=(processor, new_queries, output_text_queue),
    kwargs={"vision_chunked_length": 64},
    daemon=True,
)
worker.start()

new_queries.put(query)
drain_output()

new_queries.put({"stop_offline_generate": True})
worker.join(timeout=5.0)
```

For image-only usage, keep the same template and change:

- replace `video_path` with `image_path`
- validate `image_path` instead of `video_path`
- set `images` to `[image_path]`
- set `videos` to `[]`
- remove `media_kwargs` if you do not need video-specific controls

</details>

<details>
<summary><strong>Batched offline inference (Python)</strong></summary>

<br>

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"

shared_generate_kwargs = {
    "temperature": 1.0,
    "top_k": 50,
    "top_p": 1.0,
    "max_new_tokens": 256,
    "repetition_penalty": 1.0,
    "do_sample": False,
}

shared_media_kwargs = {
    "video_fps": 1.0,
    "video_minlen": 8,
    "video_maxlen": 256,
}


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)
queries = [
    {
        "prompt": "Describe sample A.",
        "images": [],
        "videos": ["data/sample_a.mp4"],
        "media_kwargs": dict(shared_media_kwargs),
        "generate_kwargs": dict(shared_generate_kwargs),
    },
    {
        "prompt": "Describe sample B.",
        "images": [],
        "videos": ["data/sample_b.mp4"],
        "media_kwargs": dict(shared_media_kwargs),
        "generate_kwargs": dict(shared_generate_kwargs),
    },
]

with torch.no_grad():
    result = model.offline_batch_generate(
        processor,
        queries,
        session_states=None,
        vision_chunked_length=64,
    )

texts = [item["text"] for item in result["results"]]
session_states = result["session_states"]
```

```python
followup_queries = [
    {
        "prompt": "Summarize sample A in one sentence.",
        "generate_kwargs": dict(shared_generate_kwargs),
    },
    {
        "prompt": "Restart sample B and answer again.",
        "reset_session": True,
        "generate_kwargs": dict(shared_generate_kwargs),
    },
]

with torch.no_grad():
    followup_result = model.offline_batch_generate(
        processor,
        followup_queries,
        session_states=session_states,
        vision_chunked_length=64,
    )
```

</details>

## Intended Use

- offline image understanding
- offline video understanding
- multimodal prompt experiments for release validation
- checkpoint-level inference integration and debugging

## Requirements

Core validated inference dependencies:

- `python==3.12.13`
- `torch==2.8.0+cu128`
- `torchvision==0.23.0+cu128`
- `transformers==4.57.1`
- `accelerate==1.12.0`
- `flash_attn==2.8.1`
- `torchcodec==0.7.0`
- `numpy==2.4.3`
- `pillow==12.1.1`
- `joblib==1.5.2`
- `einops==0.8.2`

Installation commands:

```bash
conda create -n moss_vl python=3.12 pip -y
conda activate moss_vl
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
```

Validated setup notes:

- CUDA runtime used for validation: `12.8`
- Inference loading uses `trust_remote_code=True` and `attn_implementation="flash_attention_2"`


## Limitations and Future Work

- realtime usage is not documented here
- benchmark, metric, and training details are still blank
- some sections are intentionally placeholders until release information is finalized
- batch calls currently require shared `generate_kwargs` and shared `media_kwargs` within one call
- batch streaming and batch cancel / stop protocol are not part of `offline_batch_generate(...)`
- the queue example is intentionally minimal and does not include production-grade timeout or worker error handling


## Citation
```bibtex
@misc{moss_vl_2026,
  title         = {{MOSS-VL Technical Report}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/fnlp-vision/MOSS-VL}},
  note          = {GitHub repository}
}
```