tsinghua-ee
/

video-SALMONN-2

Video-Text-to-Text

text-generation

text-generation-inference

Model card Files Files and versions

video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

Official model release of video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

Results

Downloads last month: 11

Safetensors

Model size

9B params

Tensor type

I64

·

BF16

·

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tsinghua-ee/video-SALMONN-2

Base model

Finetuned

(76)

this model

Datasets used to train tsinghua-ee/video-SALMONN-2

Collection including tsinghua-ee/video-SALMONN-2

video-SALMONN 2

video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions. • 11 items • Updated Mar 21 • 1

Paper for tsinghua-ee/video-SALMONN-2

video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

Paper • 2506.15220 • Published Jun 18, 2025 • 1