Video-Text-to-Text
Transformers
Safetensors
English
moss_vl
image-feature-extraction
SFT
Video-Understanding
Image-Understanding
MOSS-VL
OpenMOSS
multimodal
video
vision-language
custom_code
Instructions to use OpenMOSS-Team/MOSS-VL-Instruct-0408 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenMOSS-Team/MOSS-VL-Instruct-0408 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("OpenMOSS-Team/MOSS-VL-Instruct-0408", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| title: MOSS-VL-SFT-0408 | |
| date: 2026-04-08 | |
| category: Multimodal-LLM | |
| status: SFT | |
| language: | |
| - en | |
| library_name: transformers | |
| pipeline_tag: video-text-to-text | |
| license: apache-2.0 | |
| base_model: fnlp-vision/moss-video-preview-base | |
| tags: | |
| - SFT | |
| - Video-Understanding | |
| - Image-Understanding | |
| - MOSS-VL | |
| - OpenMOSS | |
| - multimodal | |
| - video | |
| - vision-language | |
| <p align="center"> | |
| <img src="assets/logo.png" width="320"/> | |
| </p> | |
| # MOSS-VL-SFT-0408 | |
| ## 📌 Introduction | |
| We introduce **MOSS-VL-SFT-0408**, the supervised fine-tuned checkpoint in the **MOSS-VL** series (part of the **OpenMOSS** ecosystem). | |
| > [!IMPORTANT] | |
| > This is an **SFT** checkpoint (instruction-tuned). It is **NOT** the Real-Time SFT streaming checkpoint. | |
| This model is designed as a high-performance offline engine for multimodal tasks, bridging the gap between static image understanding and dynamic real-time interaction. | |
| ### This checkpoint is intended for: | |
| - **video/image understanding** with significantly improved instruction following capabilities. | |
| - Serving as a **strong starting point** for further **Real-Time SFT** or specific domain adaptation. | |
| --- | |
| ## 🚀 Key Features & Status | |
| | Feature | Status | Description | | |
| | :--- | :---: | :--- | | |
| | **Model Loading** | ✅ | Standard HF loading with `trust_remote_code=True` | | |
| | **Image Understanding** | ✅ | Single/Multi-image input support | | |
| | **Video Understanding** | ✅ | Native video frame sequence processing | | |
| | **Mixed Inference** | ✅ | Interleaved image and video inputs | | |
| | **Offline Generation** | ✅ | Optimized `offline_generate` & `offline_batch_generate` | | |
| | **Benchmarks/Metrics** | ⏳ | Coming in future updates | | |
| --- | |
| ## 🏗 Model Architecture | |
| **MOSS-VL-SFT-0408** adopts a decoupled multimodal design, utilizing a cross-attention mechanism to bridge high-resolution visual encoding with advanced language reasoning. | |
| <p align="center"> | |
| <img src="assets/structure.png" alt="MOSS-VL Architecture" width="90%"/> | |
| <br> | |
| <em>Figure 1: MOSS-VL Core Architecture.</em> | |
| </p> | |
| ## Temporal-Aware Prompting | |
| At the model-family level, MOSS-VL uses timestamp-aware multimodal prompting for video understanding. This design gives sampled frames explicit temporal anchors, which helps the model reason about order, duration, and event localization more robustly. | |
| <p align="center"> | |
| <img src="assets/timestamp_input.svg" alt="Timestamped Sequence Input Illustration" width="90%"/> | |
| <br> | |
| <em>Figure 2: Illustration of the timestamped sequence input pipeline.</em> | |
| </p> | |
| ## Multimodal RoPE | |
| MOSS-VL uses multimodal rotary position encoding to align text tokens and visual features in a shared spatial-temporal coordinate system. At a high level, this improves video-text grounding and helps preserve temporal structure during multimodal reasoning. | |
| <p align="center"> | |
| <img src="assets/3d-rope.png" alt="MOSS-VL mRoPE Architecture Illustration" width="80%"/> | |
| <br> | |
| <em>Figure 3: 3D-RoPE spatial-temporal alignment.</em> | |
| </p> | |
| ## 🚀 Quickstart | |
| <details> | |
| <summary><strong>Queue-based offline inference (Python)</strong></summary> | |
| <br> | |
| ```python | |
| import os | |
| import queue | |
| import threading | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoProcessor | |
| checkpoint = "path/to/checkpoint" | |
| video_path = "data/example_video.mp4" | |
| prompt = "Describe the video." | |
| max_new_tokens = 1024 | |
| temperature = 1.0 | |
| top_k = 50 | |
| top_p = 1.0 | |
| repetition_penalty = 1.0 | |
| video_fps = 1.0 | |
| video_minlen = 8 | |
| video_maxlen = 256 | |
| def load_model(checkpoint: str): | |
| processor = AutoProcessor.from_pretrained( | |
| checkpoint, | |
| trust_remote_code=True, | |
| frame_extract_num_threads=1, | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| checkpoint, | |
| trust_remote_code=True, | |
| device_map="auto", | |
| torch_dtype=torch.bfloat16, | |
| attn_implementation="flash_attention_2", | |
| ) | |
| return model, processor | |
| if not checkpoint: | |
| raise ValueError("Missing `checkpoint`.") | |
| if not video_path: | |
| raise ValueError("Missing `video_path`.") | |
| if not os.path.isfile(video_path): | |
| raise FileNotFoundError(f"Video not found: {video_path}") | |
| model, processor = load_model(checkpoint) | |
| new_queries: "queue.Queue[dict]" = queue.Queue() | |
| output_text_queue: "queue.Queue[str]" = queue.Queue() | |
| query = { | |
| "prompt": prompt, | |
| "images": [], | |
| "videos": [video_path], | |
| "media_kwargs": { | |
| "video_fps": video_fps, | |
| "video_minlen": video_minlen, | |
| "video_maxlen": video_maxlen, | |
| }, | |
| "generate_kwargs": { | |
| "temperature": temperature, | |
| "top_k": top_k, | |
| "top_p": top_p, | |
| "max_new_tokens": max_new_tokens, | |
| "repetition_penalty": repetition_penalty, | |
| "do_sample": False, | |
| }, | |
| } | |
| def drain_output(): | |
| while True: | |
| tok = output_text_queue.get() | |
| if tok == "<|round_end|>": | |
| break | |
| print(tok, end="", flush=True) | |
| worker = threading.Thread( | |
| target=model.offline_generate, | |
| args=(processor, new_queries, output_text_queue), | |
| kwargs={"vision_chunked_length": 64}, | |
| daemon=True, | |
| ) | |
| worker.start() | |
| new_queries.put(query) | |
| drain_output() | |
| new_queries.put({"stop_offline_generate": True}) | |
| worker.join(timeout=5.0) | |
| ``` | |
| For image-only usage, keep the same template and change: | |
| - replace `video_path` with `image_path` | |
| - validate `image_path` instead of `video_path` | |
| - set `images` to `[image_path]` | |
| - set `videos` to `[]` | |
| - remove `media_kwargs` if you do not need video-specific controls | |
| </details> | |
| <details> | |
| <summary><strong>Batched offline inference (Python)</strong></summary> | |
| <br> | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoProcessor | |
| checkpoint = "path/to/checkpoint" | |
| shared_generate_kwargs = { | |
| "temperature": 1.0, | |
| "top_k": 50, | |
| "top_p": 1.0, | |
| "max_new_tokens": 256, | |
| "repetition_penalty": 1.0, | |
| "do_sample": False, | |
| } | |
| shared_media_kwargs = { | |
| "video_fps": 1.0, | |
| "video_minlen": 8, | |
| "video_maxlen": 256, | |
| } | |
| def load_model(checkpoint: str): | |
| processor = AutoProcessor.from_pretrained( | |
| checkpoint, | |
| trust_remote_code=True, | |
| frame_extract_num_threads=1, | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| checkpoint, | |
| trust_remote_code=True, | |
| device_map="auto", | |
| torch_dtype=torch.bfloat16, | |
| attn_implementation="flash_attention_2", | |
| ) | |
| return model, processor | |
| model, processor = load_model(checkpoint) | |
| queries = [ | |
| { | |
| "prompt": "Describe sample A.", | |
| "images": [], | |
| "videos": ["data/sample_a.mp4"], | |
| "media_kwargs": dict(shared_media_kwargs), | |
| "generate_kwargs": dict(shared_generate_kwargs), | |
| }, | |
| { | |
| "prompt": "Describe sample B.", | |
| "images": [], | |
| "videos": ["data/sample_b.mp4"], | |
| "media_kwargs": dict(shared_media_kwargs), | |
| "generate_kwargs": dict(shared_generate_kwargs), | |
| }, | |
| ] | |
| with torch.no_grad(): | |
| result = model.offline_batch_generate( | |
| processor, | |
| queries, | |
| session_states=None, | |
| vision_chunked_length=64, | |
| ) | |
| texts = [item["text"] for item in result["results"]] | |
| session_states = result["session_states"] | |
| ``` | |
| ```python | |
| followup_queries = [ | |
| { | |
| "prompt": "Summarize sample A in one sentence.", | |
| "generate_kwargs": dict(shared_generate_kwargs), | |
| }, | |
| { | |
| "prompt": "Restart sample B and answer again.", | |
| "reset_session": True, | |
| "generate_kwargs": dict(shared_generate_kwargs), | |
| }, | |
| ] | |
| with torch.no_grad(): | |
| followup_result = model.offline_batch_generate( | |
| processor, | |
| followup_queries, | |
| session_states=session_states, | |
| vision_chunked_length=64, | |
| ) | |
| ``` | |
| </details> | |
| ## Intended Use | |
| - offline image understanding | |
| - offline video understanding | |
| - multimodal prompt experiments for release validation | |
| - checkpoint-level inference integration and debugging | |
| ## Requirements | |
| Core validated inference dependencies: | |
| - `python==3.12.13` | |
| - `torch==2.8.0+cu128` | |
| - `torchvision==0.23.0+cu128` | |
| - `transformers==4.57.1` | |
| - `accelerate==1.12.0` | |
| - `flash_attn==2.8.1` | |
| - `torchcodec==0.7.0` | |
| - `numpy==2.4.3` | |
| - `pillow==12.1.1` | |
| - `joblib==1.5.2` | |
| - `einops==0.8.2` | |
| Installation commands: | |
| ```bash | |
| conda create -n moss_vl python=3.12 pip -y | |
| conda activate moss_vl | |
| pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt | |
| ``` | |
| Validated setup notes: | |
| - CUDA runtime used for validation: `12.8` | |
| - Inference loading uses `trust_remote_code=True` and `attn_implementation="flash_attention_2"` | |
| ## Limitations and Future Work | |
| - realtime usage is not documented here | |
| - benchmark, metric, and training details are still blank | |
| - some sections are intentionally placeholders until release information is finalized | |
| - batch calls currently require shared `generate_kwargs` and shared `media_kwargs` within one call | |
| - batch streaming and batch cancel / stop protocol are not part of `offline_batch_generate(...)` | |
| - the queue example is intentionally minimal and does not include production-grade timeout or worker error handling | |
| ## Citation | |
| ```bibtex | |
| @misc{moss_vl_2026, | |
| title = {{MOSS-VL Technical Report}}, | |
| author = {OpenMOSS Team}, | |
| year = {2026}, | |
| howpublished = {\url{https://github.com/fnlp-vision/MOSS-VL}}, | |
| note = {GitHub repository} | |
| } | |
| ``` | |