Upload folder using huggingface_hub

b66ac48 2 months ago

9.34 kB

	---
	title: MOSS-VL-SFT-0408
	date: 2026-04-08
	category: Multimodal-LLM
	status: SFT
	language:
	- en
	library_name: transformers
	pipeline_tag: video-text-to-text
	license: apache-2.0
	base_model: fnlp-vision/moss-video-preview-base
	tags:
	- SFT
	- Video-Understanding
	- Image-Understanding
	- MOSS-VL
	- OpenMOSS
	- multimodal
	- video
	- vision-language
	---

	<p align="center">
	<img src="assets/logo.png" width="320"/>
	</p>

	# MOSS-VL-SFT-0408

	## 📌 Introduction

	We introduce MOSS-VL-SFT-0408, the supervised fine-tuned checkpoint in the MOSS-VL series (part of the OpenMOSS ecosystem).

	> [!IMPORTANT]
	> This is an SFT checkpoint (instruction-tuned). It is NOT the Real-Time SFT streaming checkpoint.

	This model is designed as a high-performance offline engine for multimodal tasks, bridging the gap between static image understanding and dynamic real-time interaction.

	### This checkpoint is intended for:

	- video/image understanding with significantly improved instruction following capabilities.
	- Serving as a strong starting point for further Real-Time SFT or specific domain adaptation.

	---

	## 🚀 Key Features & Status

	\| Feature \| Status \| Description \|
	\| :--- \| :---: \| :--- \|
	\| Model Loading \| ✅ \| Standard HF loading with `trust_remote_code=True` \|
	\| Image Understanding \| ✅ \| Single/Multi-image input support \|
	\| Video Understanding \| ✅ \| Native video frame sequence processing \|
	\| Mixed Inference \| ✅ \| Interleaved image and video inputs \|
	\| Offline Generation \| ✅ \| Optimized `offline_generate` & `offline_batch_generate` \|
	\| Benchmarks/Metrics \| ⏳ \| Coming in future updates \|

	---

	## 🏗 Model Architecture

	MOSS-VL-SFT-0408 adopts a decoupled multimodal design, utilizing a cross-attention mechanism to bridge high-resolution visual encoding with advanced language reasoning.

	<p align="center">
	<img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/>
	<br>
	<em>Figure 1: MOSS-VL Core Architecture.</em>
	</p>


	## Temporal-Aware Prompting

	At the model-family level, MOSS-VL uses timestamp-aware multimodal prompting for video understanding. This design gives sampled frames explicit temporal anchors, which helps the model reason about order, duration, and event localization more robustly.

	<p align="center">
	<img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/>
	<br>
	<em>Figure 2: Illustration of the timestamped sequence input pipeline.</em>
	</p>

	## Multimodal RoPE

	MOSS-VL uses multimodal rotary position encoding to align text tokens and visual features in a shared spatial-temporal coordinate system. At a high level, this improves video-text grounding and helps preserve temporal structure during multimodal reasoning.

	<p align="center">
	<img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/>
	<br>
	<em>Figure 3: 3D-RoPE spatial-temporal alignment.</em>
	</p>




	## 🚀 Quickstart

	<details>
	<summary><strong>Queue-based offline inference (Python)</strong></summary>

	<br>

	```python
	import os
	import queue
	import threading

	import torch
	from transformers import AutoModelForCausalLM, AutoProcessor

	checkpoint = "path/to/checkpoint"
	video_path = "data/example_video.mp4"
	prompt = "Describe the video."

	max_new_tokens = 1024
	temperature = 1.0
	top_k = 50
	top_p = 1.0
	repetition_penalty = 1.0

	video_fps = 1.0
	video_minlen = 8
	video_maxlen = 256


	def load_model(checkpoint: str):
	processor = AutoProcessor.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	frame_extract_num_threads=1,
	)
	model = AutoModelForCausalLM.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	)
	return model, processor


	if not checkpoint:
	raise ValueError("Missing `checkpoint`.")
	if not video_path:
	raise ValueError("Missing `video_path`.")
	if not os.path.isfile(video_path):
	raise FileNotFoundError(f"Video not found: {video_path}")

	model, processor = load_model(checkpoint)
	new_queries: "queue.Queue[dict]" = queue.Queue()
	output_text_queue: "queue.Queue[str]" = queue.Queue()

	query = {
	"prompt": prompt,
	"images": [],
	"videos": [video_path],
	"media_kwargs": {
	"video_fps": video_fps,
	"video_minlen": video_minlen,
	"video_maxlen": video_maxlen,
	},
	"generate_kwargs": {
	"temperature": temperature,
	"top_k": top_k,
	"top_p": top_p,
	"max_new_tokens": max_new_tokens,
	"repetition_penalty": repetition_penalty,
	"do_sample": False,
	},
	}


	def drain_output():
	while True:
	tok = output_text_queue.get()
	if tok == "<\|round_end\|>":
	break
	print(tok, end="", flush=True)


	worker = threading.Thread(
	target=model.offline_generate,
	args=(processor, new_queries, output_text_queue),
	kwargs={"vision_chunked_length": 64},
	daemon=True,
	)
	worker.start()

	new_queries.put(query)
	drain_output()

	new_queries.put({"stop_offline_generate": True})
	worker.join(timeout=5.0)
	```

	For image-only usage, keep the same template and change:

	- replace `video_path` with `image_path`
	- validate `image_path` instead of `video_path`
	- set `images` to `[image_path]`
	- set `videos` to `[]`
	- remove `media_kwargs` if you do not need video-specific controls

	</details>

	<details>
	<summary><strong>Batched offline inference (Python)</strong></summary>

	<br>

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoProcessor

	checkpoint = "path/to/checkpoint"

	shared_generate_kwargs = {
	"temperature": 1.0,
	"top_k": 50,
	"top_p": 1.0,
	"max_new_tokens": 256,
	"repetition_penalty": 1.0,
	"do_sample": False,
	}

	shared_media_kwargs = {
	"video_fps": 1.0,
	"video_minlen": 8,
	"video_maxlen": 256,
	}


	def load_model(checkpoint: str):
	processor = AutoProcessor.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	frame_extract_num_threads=1,
	)
	model = AutoModelForCausalLM.from_pretrained(
	checkpoint,
	trust_remote_code=True,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	)
	return model, processor


	model, processor = load_model(checkpoint)
	queries = [
	{
	"prompt": "Describe sample A.",
	"images": [],
	"videos": ["data/sample_a.mp4"],
	"media_kwargs": dict(shared_media_kwargs),
	"generate_kwargs": dict(shared_generate_kwargs),
	},
	{
	"prompt": "Describe sample B.",
	"images": [],
	"videos": ["data/sample_b.mp4"],
	"media_kwargs": dict(shared_media_kwargs),
	"generate_kwargs": dict(shared_generate_kwargs),
	},
	]

	with torch.no_grad():
	result = model.offline_batch_generate(
	processor,
	queries,
	session_states=None,
	vision_chunked_length=64,
	)

	texts = [item["text"] for item in result["results"]]
	session_states = result["session_states"]
	```

	```python
	followup_queries = [
	{
	"prompt": "Summarize sample A in one sentence.",
	"generate_kwargs": dict(shared_generate_kwargs),
	},
	{
	"prompt": "Restart sample B and answer again.",
	"reset_session": True,
	"generate_kwargs": dict(shared_generate_kwargs),
	},
	]

	with torch.no_grad():
	followup_result = model.offline_batch_generate(
	processor,
	followup_queries,
	session_states=session_states,
	vision_chunked_length=64,
	)
	```

	</details>

	## Intended Use

	- offline image understanding
	- offline video understanding
	- multimodal prompt experiments for release validation
	- checkpoint-level inference integration and debugging

	## Requirements

	Core validated inference dependencies:

	- `python==3.12.13`
	- `torch==2.8.0+cu128`
	- `torchvision==0.23.0+cu128`
	- `transformers==4.57.1`
	- `accelerate==1.12.0`
	- `flash_attn==2.8.1`
	- `torchcodec==0.7.0`
	- `numpy==2.4.3`
	- `pillow==12.1.1`
	- `joblib==1.5.2`
	- `einops==0.8.2`

	Installation commands:

	```bash
	conda create -n moss_vl python=3.12 pip -y
	conda activate moss_vl
	pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt
	```

	Validated setup notes:

	- CUDA runtime used for validation: `12.8`
	- Inference loading uses `trust_remote_code=True` and `attn_implementation="flash_attention_2"`


	## Limitations and Future Work

	- realtime usage is not documented here
	- benchmark, metric, and training details are still blank
	- some sections are intentionally placeholders until release information is finalized
	- batch calls currently require shared `generate_kwargs` and shared `media_kwargs` within one call
	- batch streaming and batch cancel / stop protocol are not part of `offline_batch_generate(...)`
	- the queue example is intentionally minimal and does not include production-grade timeout or worker error handling


	## Citation
	```bibtex
	@misc{moss_vl_2026,
	title = {{MOSS-VL Technical Report}},
	author = {OpenMOSS Team},
	year = {2026},
	howpublished = {\url{https://github.com/fnlp-vision/MOSS-VL}},
	note = {GitHub repository}
	}
	```