MOSS-VL-Base-0408 / README.md
CCCCyx's picture
Upload README.md
b06e30d verified
|
raw
history blame
9.53 kB
metadata
title: MOSS-VL-Base-0408
date: 2026-04-08T00:00:00.000Z
category: Multimodal-LLM
status: Base
language:
  - en
library_name: transformers
pipeline_tag: video-text-to-text
license: apache-2.0
base_model: fnlp-vision/moss-video-preview-base
tags:
  - Base
  - Video-Understanding
  - Image-Understanding
  - MOSS-VL
  - OpenMOSS
  - multimodal
  - video
  - vision-language

MOSS-VL-Base-0408

πŸ“Œ Introduction

We introduce MOSS-VL-Base-0408, the base checkpoint in the MOSS-VL series (part of the OpenMOSS ecosystem).

This is a base checkpoint. It has NOT undergone supervised fine-tuning (SFT) or instruction tuning.

This model is trained through four stages of pretraining only:

  1. Stage 1: Vision-language alignment
  2. Stage 2: Large-scale multimodal pretraining
  3. Stage 3: High-quality multimodal pretraining
  4. Stage 4: Annealing and long-context extension

This model is designed as a high-performance offline engine for multimodal tasks and serves as a strong base foundation for downstream adaptation.

This checkpoint is intended for:

  • video/image understanding and general multimodal representation learning.
  • Serving as a strong starting point for future SFT, alignment, or specific domain adaptation.

πŸš€ Key Features & Status

Feature Status Description
Model Loading βœ… Standard HF loading with trust_remote_code=True
Image Understanding βœ… Single/Multi-image input support
Video Understanding βœ… Native video frame sequence processing
Mixed Inference βœ… Interleaved image and video inputs
Offline Generation βœ… Optimized offline_generate & offline_batch_generate
Benchmarks/Metrics ⏳ Coming in future updates

πŸ— Model Architecture

MOSS-VL-Base-0408 adopts a decoupled multimodal design, utilizing a cross-attention mechanism to bridge high-resolution visual encoding with advanced language modeling.

MOSS-VL Architecture
Figure 1: MOSS-VL Core Architecture.

Temporal-Aware Prompting

At the model-family level, MOSS-VL uses timestamp-aware multimodal prompting for video understanding. This design gives sampled frames explicit temporal anchors, which helps the model reason about order, duration, and event localization more robustly.

Timestamped Sequence Input Illustration
Figure 2: Illustration of the timestamped sequence input pipeline.

Multimodal RoPE

MOSS-VL uses multimodal rotary position encoding to align text tokens and visual features in a shared spatial-temporal coordinate system. At a high level, this improves video-text grounding and helps preserve temporal structure during multimodal reasoning.

MOSS-VL mRoPE Architecture Illustration
Figure 3: 3D-RoPE spatial-temporal alignment.

πŸš€ Quickstart

Queue-based offline inference (Python)
import os
import queue
import threading

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"
video_path = "data/example_video.mp4"
prompt = "Describe the video."

max_new_tokens = 1024
temperature = 1.0
top_k = 50
top_p = 1.0
repetition_penalty = 1.0

video_fps = 1.0
video_minlen = 8
video_maxlen = 256


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


if not checkpoint:
    raise ValueError("Missing `checkpoint`.")
if not video_path:
    raise ValueError("Missing `video_path`.")
if not os.path.isfile(video_path):
    raise FileNotFoundError(f"Video not found: {video_path}")

model, processor = load_model(checkpoint)
new_queries: "queue.Queue[dict]" = queue.Queue()
output_text_queue: "queue.Queue[str]" = queue.Queue()

query = {
    "prompt": prompt,
    "images": [],
    "videos": [video_path],
    "media_kwargs": {
        "video_fps": video_fps,
        "video_minlen": video_minlen,
        "video_maxlen": video_maxlen,
    },
    "generate_kwargs": {
        "temperature": temperature,
        "top_k": top_k,
        "top_p": top_p,
        "max_new_tokens": max_new_tokens,
        "repetition_penalty": repetition_penalty,
        "do_sample": False,
    },
}


def drain_output():
    while True:
        tok = output_text_queue.get()
        if tok == "<|round_end|>":
            break
        print(tok, end="", flush=True)


worker = threading.Thread(
    target=model.offline_generate,
    args=(processor, new_queries, output_text_queue),
    kwargs={"vision_chunked_length": 64},
    daemon=True,
)
worker.start()

new_queries.put(query)
drain_output()

new_queries.put({"stop_offline_generate": True})
worker.join(timeout=5.0)

For image-only usage, keep the same template and change:

  • replace video_path with image_path
  • validate image_path instead of video_path
  • set images to [image_path]
  • set videos to []
  • remove media_kwargs if you do not need video-specific controls
Batched offline inference (Python)
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "path/to/checkpoint"

shared_generate_kwargs = {
    "temperature": 1.0,
    "top_k": 50,
    "top_p": 1.0,
    "max_new_tokens": 256,
    "repetition_penalty": 1.0,
    "do_sample": False,
}

shared_media_kwargs = {
    "video_fps": 1.0,
    "video_minlen": 8,
    "video_maxlen": 256,
}


def load_model(checkpoint: str):
    processor = AutoProcessor.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        frame_extract_num_threads=1,
    )
    model = AutoModelForCausalLM.from_pretrained(
        checkpoint,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    return model, processor


model, processor = load_model(checkpoint)
queries = [
    {
        "prompt": "Describe sample A.",
        "images": [],
        "videos": ["data/sample_a.mp4"],
        "media_kwargs": dict(shared_media_kwargs),
        "generate_kwargs": dict(shared_generate_kwargs),
    },
    {
        "prompt": "Describe sample B.",
        "images": [],
        "videos": ["data/sample_b.mp4"],
        "media_kwargs": dict(shared_media_kwargs),
        "generate_kwargs": dict(shared_generate_kwargs),
    },
]

with torch.no_grad():
    result = model.offline_batch_generate(
        processor,
        queries,
        session_states=None,
        vision_chunked_length=64,
    )

texts = [item["text"] for item in result["results"]]
session_states = result["session_states"]
followup_queries = [
    {
        "prompt": "Summarize sample A in one sentence.",
        "generate_kwargs": dict(shared_generate_kwargs),
    },
    {
        "prompt": "Restart sample B and answer again.",
        "reset_session": True,
        "generate_kwargs": dict(shared_generate_kwargs),
    },
]

with torch.no_grad():
    followup_result = model.offline_batch_generate(
        processor,
        followup_queries,
        session_states=session_states,
        vision_chunked_length=64,
    )

Intended Use

  • offline image understanding
  • offline video understanding
  • multimodal prompt experiments for release validation
  • checkpoint-level inference integration and debugging

Requirements

Core validated inference dependencies:

  • python==3.12.13
  • torch==2.8.0+cu128
  • torchvision==0.23.0+cu128
  • transformers==4.57.1
  • accelerate==1.12.0
  • flash_attn==2.8.1
  • torchcodec==0.7.0
  • numpy==2.4.3
  • pillow==12.1.1
  • joblib==1.5.2
  • einops==0.8.2

Installation commands:

conda create -n moss_vl python=3.12 pip -y
conda activate moss_vl
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt

Validated setup notes:

  • CUDA runtime used for validation: 12.8
  • Inference loading uses trust_remote_code=True and attn_implementation="flash_attention_2"

Limitations and Future Work

  • realtime usage is not documented here
  • benchmark, metric, and training details are still blank
  • some sections are intentionally placeholders until release information is finalized
  • batch calls currently require shared generate_kwargs and shared media_kwargs within one call
  • batch streaming and batch cancel / stop protocol are not part of offline_batch_generate(...)
  • the queue example is intentionally minimal and does not include production-grade timeout or worker error handling

Citation

@misc{moss_vl_2026,
  title         = {{MOSS-VL Technical Report}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/fnlp-vision/MOSS-VL}},
  note          = {GitHub repository}
}