Video-Text-to-Text
Transformers
Safetensors
English
moss_vl
image-feature-extraction
Base
Video-Understanding
Image-Understanding
MOSS-VL
OpenMOSS
multimodal
video
vision-language
custom_code
Instructions to use OpenMOSS-Team/MOSS-VL-Base-0408 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenMOSS-Team/MOSS-VL-Base-0408 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("OpenMOSS-Team/MOSS-VL-Base-0408", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -31,10 +31,10 @@ MOSS-VL-Base-0408 is the foundation checkpoint of the MOSS-VL series, part of th
|
|
| 31 |
|
| 32 |
Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal foundation model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
|
| 39 |
### ✨ Highlights
|
| 40 |
|
|
|
|
| 31 |
|
| 32 |
Built through four stages of multimodal pretraining only, this checkpoint serves as a high-capacity offline multimodal foundation model. It provides strong general-purpose visual-linguistic representations across image and video inputs, and is intended primarily as the base model for downstream supervised fine-tuning, alignment, and domain adaptation
|
| 33 |
|
| 34 |
+
- Stage 1: Vision-language alignment
|
| 35 |
+
- Stage 2: Large-scale multimodal pretraining
|
| 36 |
+
- Stage 3: High-quality multimodal pretraining
|
| 37 |
+
- Stage 4: Annealing and long-context extension
|
| 38 |
|
| 39 |
### ✨ Highlights
|
| 40 |
|