File size: 9,336 Bytes
b66ac48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
---
title: MOSS-VL-SFT-0408
date: 2026-04-08
category: Multimodal-LLM
status: SFT
language:
- en
library_name: transformers
pipeline_tag: video-text-to-text
license: apache-2.0
base_model: fnlp-vision/moss-video-preview-base
tags:
- SFT
- Video-Understanding
- Image-Understanding
- MOSS-VL
- OpenMOSS
- multimodal
- video
- vision-language
---

<p align="center">
   <img src="assets/logo.png" width="320"/>
</p>

# MOSS-VL-SFT-0408

## πŸ“Œ Introduction

We introduce **MOSS-VL-SFT-0408**, the supervised fine-tuned checkpoint in the **MOSS-VL** series (part of the **OpenMOSS** ecosystem).

> [!IMPORTANT]
> This is an **SFT** checkpoint (instruction-tuned). It is **NOT** the Real-Time SFT streaming checkpoint.

This model is designed as a high-performance offline engine for multimodal tasks, bridging the gap between static image understanding and dynamic real-time interaction.

### This checkpoint is intended for:

-   **video/image understanding** with significantly improved instruction following capabilities.
-   Serving as a **strong starting point** for further **Real-Time SFT** or specific domain adaptation.

---

## πŸš€ Key Features & Status

| Feature | Status | Description |
| :--- | :---: | :--- |
| **Model Loading** | βœ… | Standard HF loading with `trust_remote_code=True` |
| **Image Understanding** | βœ… | Single/Multi-image input support |
| **Video Understanding** | βœ… | Native video frame sequence processing |
| **Mixed Inference** | βœ… | Interleaved image and video inputs |
| **Offline Generation** | βœ… | Optimized `offline_generate` & `offline_batch_generate` |
| **Benchmarks/Metrics** | ⏳ | Coming in future updates |

---

## πŸ— Model Architecture

**MOSS-VL-SFT-0408** adopts a decoupled multimodal design, utilizing a cross-attention mechanism to bridge high-resolution visual encoding with advanced language reasoning. 

<p align="center">
    <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
    <br>
    <em>Figure 1: MOSS-VL Core Architecture.</em>
</p>


## Temporal-Aware Prompting

At the model-family level, MOSS-VL uses timestamp-aware multimodal prompting for video understanding. This design gives sampled frames explicit temporal anchors, which helps the model reason about order, duration, and event localization more robustly.

<p align="center">
    <img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
    <br>
    <em>Figure 2: Illustration of the timestamped sequence input pipeline.</em>
</p>

## Multimodal RoPE

MOSS-VL uses multimodal rotary position encoding to align text tokens and visual features in a shared spatial-temporal coordinate system. At a high level, this improves video-text grounding and helps preserve temporal structure during multimodal reasoning.

<p align="center">
    <img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
    <br>
    <em>Figure 3: 3D-RoPE spatial-temporal alignment.</em>
</p>




## πŸš€ Quickstart

<details>
<summary><strong>Queue-based offline inference (Python)</strong></summary>

<br>

```python
import os
import queue
import threading

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
video_path = "data/example_video.mp4"
prompt = "Describe the video."

max_new_tokens = 1024
temperature = 1.0
top_k = 50
top_p = 1.0
repetition_penalty = 1.0

video_fps = 1.0
video_minlen = 8
video_maxlen = 256


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


if not checkpoint:
    raise ValueError("Missing `checkpoint`.")
if not video_path:
    raise ValueError("Missing `video_path`.")
if not os.path.isfile(video_path):
    raise FileNotFoundError(f"Video not found: {video_path}")

model, processor = load_model(checkpoint)
new_queries: "queue.Queue[dict]" = queue.Queue()
output_text_queue: "queue.Queue[str]" = queue.Queue()

query = {
    "prompt": prompt,
    "images": [],
    "videos": [video_path],
    "media_kwargs": {
        "video_fps": video_fps,
        "video_minlen": video_minlen,
        "video_maxlen": video_maxlen,
    },
    "generate_kwargs": {
        "temperature": temperature,
        "top_k": top_k,
        "top_p": top_p,
        "max_new_tokens": max_new_tokens,
        "repetition_penalty": repetition_penalty,
        "do_sample": False,
    },
}


def drain_output():
    while True:
        tok = output_text_queue.get()
        if tok == "<|round_end|>":
            break
        print(tok, end="", flush=True)


worker = threading.Thread(
    target=model.offline_generate,
    args=(processor, new_queries, output_text_queue),
    kwargs={"vision_chunked_length": 64},
    daemon=True,
)
worker.start()

new_queries.put(query)
drain_output()

new_queries.put({"stop_offline_generate": True})
worker.join(timeout=5.0)
```

For image-only usage, keep the same template and change:

- replace `video_path` with `image_path`
- validate `image_path` instead of `video_path`
- set `images` to `[image_path]`
- set `videos` to `[]`
- remove `media_kwargs` if you do not need video-specific controls

</details>

<details>
<summary><strong>Batched offline inference (Python)</strong></summary>

<br>

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"

shared_generate_kwargs = {
    "temperature": 1.0,
    "top_k": 50,
    "top_p": 1.0,
    "max_new_tokens": 256,
    "repetition_penalty": 1.0,
    "do_sample": False,
}

shared_media_kwargs = {
    "video_fps": 1.0,
    "video_minlen": 8,
    "video_maxlen": 256,
}


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)
queries = [
    {
        "prompt": "Describe sample A.",
        "images": [],
        "videos": ["data/sample_a.mp4"],
        "media_kwargs": dict(shared_media_kwargs),
        "generate_kwargs": dict(shared_generate_kwargs),
    },
    {
        "prompt": "Describe sample B.",
        "images": [],
        "videos": ["data/sample_b.mp4"],
        "media_kwargs": dict(shared_media_kwargs),
        "generate_kwargs": dict(shared_generate_kwargs),
    },
]

with torch.no_grad():
    result = model.offline_batch_generate(
        processor,
        queries,
        session_states=None,
        vision_chunked_length=64,
    )

texts = [item["text"] for item in result["results"]]
session_states = result["session_states"]
```

```python
followup_queries = [
    {
        "prompt": "Summarize sample A in one sentence.",
        "generate_kwargs": dict(shared_generate_kwargs),
    },
    {
        "prompt": "Restart sample B and answer again.",
        "reset_session": True,
        "generate_kwargs": dict(shared_generate_kwargs),
    },
]

with torch.no_grad():
    followup_result = model.offline_batch_generate(
        processor,
        followup_queries,
        session_states=session_states,
        vision_chunked_length=64,
    )
```

</details>

## Intended Use

- offline image understanding
- offline video understanding
- multimodal prompt experiments for release validation
- checkpoint-level inference integration and debugging

## Requirements

Core validated inference dependencies:

- `python==3.12.13`
- `torch==2.8.0+cu128`
- `torchvision==0.23.0+cu128`
- `transformers==4.57.1`
- `accelerate==1.12.0`
- `flash_attn==2.8.1`
- `torchcodec==0.7.0`
- `numpy==2.4.3`
- `pillow==12.1.1`
- `joblib==1.5.2`
- `einops==0.8.2`

Installation commands:

```bash
conda create -n moss_vl python=3.12 pip -y
conda activate moss_vl
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
```

Validated setup notes:

- CUDA runtime used for validation: `12.8`
- Inference loading uses `trust_remote_code=True` and `attn_implementation="flash_attention_2"`


## Limitations and Future Work

- realtime usage is not documented here
- benchmark, metric, and training details are still blank
- some sections are intentionally placeholders until release information is finalized
- batch calls currently require shared `generate_kwargs` and shared `media_kwargs` within one call
- batch streaming and batch cancel / stop protocol are not part of `offline_batch_generate(...)`
- the queue example is intentionally minimal and does not include production-grade timeout or worker error handling


## Citation
```bibtex
@misc{moss_vl_2026,
  title         = {{MOSS-VL Technical Report}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/fnlp-vision/MOSS-VL}},
  note          = {GitHub repository}
}
```