OpenMOSS-Team
/

MOSS-VL-Base-0408

Video-Text-to-Text

image-feature-extraction

Video-Understanding

Image-Understanding

vision-language

Model card Files Files and versions

CCCCyx commited on Apr 8

Commit

a5ceaa6

·

verified ·

1 Parent(s): 106e9fe

Update README.md

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -31,10 +31,10 @@ MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of th
 Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal foundation model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation
-1. Stage 1: Vision-language alignment
-2. Stage 2: Large-scale multimodal pretraining
-3. Stage 3: High-quality multimodal pretraining
-4. Stage 4: Annealing and long-context extension
 ### ✨ Highlights

 Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal foundation model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation
+- Stage 1: Vision-language alignment
+- Stage 2: Large-scale multimodal pretraining
+- Stage 3: High-quality multimodal pretraining
+- Stage 4: Annealing and long-context extension
 ### ✨ Highlights