OpenMOSS-Team
/

MOSS-VL-Base-0408

Video-Text-to-Text

image-feature-extraction

Video-Understanding

Image-Understanding

vision-language

Model card Files Files and versions

CCCCyx commited on Apr 8

Commit

3411f0a

·

verified ·

1 Parent(s): 9b57d0f

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -40,8 +40,8 @@ Specifically, the pretraining pipeline is structured into the following four pro
 ### ✨ Highlights
-- 📐 **Native Dynamic Resolution** — MOSS-VL-Base-0408 natively processes images and video frames at their original aspect ratios and resolutions without forced resizing, padding, or cropping. By preserving the raw spatial layout, it faithfully captures fine visual details across diverse formats—from high-resolution photographs and dense document scans to ultra-wide screenshots.
-- 🎞️ **Native Interleaved Image & Video Inputs** — The model accepts arbitrary combinations of images and videos within a single sequence. Through a unified end-to-end pipeline, it seamlessly handles complex mixed-modality prompts, multi-image comparisons, and interleaved visual narratives without requiring modality-specific pre-processing or separate routing logic.
 ## 🏗 Model Architecture

 ### ✨ Highlights
+- 📐 **Native Dynamic Resolution** MOSS-VL-Base-0408 natively processes images and video frames at their original aspect ratios and resolutions. By preserving the raw spatial layout, it faithfully captures fine visual details across diverse formats—from high-resolution photographs and dense document scans to ultra-wide screenshots.
+- 🎞️ **Native Interleaved Image & Video Inputs** The model accepts arbitrary combinations of images and videos within a single sequence. Through a unified end-to-end pipeline, it seamlessly handles complex mixed-modality prompts, multi-image comparisons, and interleaved visual narratives without requiring modality-specific pre-processing or separate routing logic.
 ## 🏗 Model Architecture