Models
Datasets
Spaces
Buckets new
Docs
Enterprise
Pricing
- Website
- Community
- Solutions
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2504.07491

DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 192
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Paper • 2401.00849 • Published Jan 1, 2024 • 18
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 52
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Paper • 2311.00571 • Published Nov 1, 2023 • 43

Reinforcement Pre-Training

Paper • 2506.08007 • Published Jun 9, 2025 • 265
A Survey on Latent Reasoning

Paper • 2507.06203 • Published Jul 8, 2025 • 95
Language Models are Few-Shot Learners

Paper • 2005.14165 • Published May 28, 2020 • 20
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Paper • 1910.10683 • Published Oct 23, 2019 • 19

Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10, 2025 • 142

academic papers

One-Minute Video Generation with Test-Time Training

Paper • 2504.05298 • Published Apr 7, 2025 • 110
Slow-Fast Architecture for Video Multi-Modal Large Language Models

Paper • 2504.01328 • Published Apr 2, 2025 • 7
Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10, 2025 • 142

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7, 2025 • 209
Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10, 2025 • 142

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24, 2024 • 17
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 55
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 91
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 35

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Paper • 2506.05176 • Published Jun 5, 2025 • 83
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

Paper • 2506.04207 • Published Jun 4, 2025 • 48
MiMo-VL Technical Report

Paper • 2506.03569 • Published Jun 4, 2025 • 81
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Paper • 2506.03147 • Published Jun 3, 2025 • 59

Vision Language Models: 2025 Update

This collection includes all the models, datasets and Spaces mentioned in the blog Vision Language Models: 2025 Update

Qwen/Qwen2.5-Omni-7B

Any-to-Any • 11B • Updated Apr 30, 2025 • 731k • 1.91k
Running

Agents

Featured

372

Qwen2.5 Omni 7B Demo

🏆

372

Chat with text, audio, images, and video, get spoken replies
Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26, 2025 • 173
openbmb/MiniCPM-o-2_6

Any-to-Any • 9B • Updated Oct 5, 2025 • 429k • 1.29k

Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10, 2025 • 142

Moonshot's efficient MoE VLMs, exceptional on agent, long-context, and thinking

Build error

Agents

Featured

199

Chat with Kimi-VL-A3B-Thinking-2506

🤔

199

Chat with Kimi-VL: respond to text, images, video, PDFs
moonshotai/Kimi-VL-A3B-Thinking-2506

Image-Text-to-Text • 16B • Updated Jan 30 • 6.64k • 362
moonshotai/Kimi-VL-A3B-Instruct

Image-Text-to-Text • 16B • Updated Jan 30 • 295k • 268
moonshotai/Kimi-VL-A3B-Thinking

Image-Text-to-Text • 16B • Updated Jan 30 • 139k • 448

DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 192
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Paper • 2401.00849 • Published Jan 1, 2024 • 18
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 52
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Paper • 2311.00571 • Published Nov 1, 2023 • 43

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24, 2024 • 17
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 55
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 91
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 35

Reinforcement Pre-Training

Paper • 2506.08007 • Published Jun 9, 2025 • 265
A Survey on Latent Reasoning

Paper • 2507.06203 • Published Jul 8, 2025 • 95
Language Models are Few-Shot Learners

Paper • 2005.14165 • Published May 28, 2020 • 20
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Paper • 1910.10683 • Published Oct 23, 2019 • 19

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Paper • 2506.05176 • Published Jun 5, 2025 • 83
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

Paper • 2506.04207 • Published Jun 4, 2025 • 48
MiMo-VL Technical Report

Paper • 2506.03569 • Published Jun 4, 2025 • 81
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Paper • 2506.03147 • Published Jun 3, 2025 • 59

Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10, 2025 • 142

Vision Language Models: 2025 Update

This collection includes all the models, datasets and Spaces mentioned in the blog Vision Language Models: 2025 Update

Qwen/Qwen2.5-Omni-7B

Any-to-Any • 11B • Updated Apr 30, 2025 • 731k • 1.91k
Running

Agents

Featured

372

Qwen2.5 Omni 7B Demo

🏆

372

Chat with text, audio, images, and video, get spoken replies
Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26, 2025 • 173
openbmb/MiniCPM-o-2_6

Any-to-Any • 9B • Updated Oct 5, 2025 • 429k • 1.29k

academic papers

One-Minute Video Generation with Test-Time Training

Paper • 2504.05298 • Published Apr 7, 2025 • 110
Slow-Fast Architecture for Video Multi-Modal Large Language Models

Paper • 2504.01328 • Published Apr 2, 2025 • 7
Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10, 2025 • 142

Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10, 2025 • 142

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7, 2025 • 209
Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10, 2025 • 142

Moonshot's efficient MoE VLMs, exceptional on agent, long-context, and thinking

Build error

Agents

Featured

199

Chat with Kimi-VL-A3B-Thinking-2506

🤔

199

Chat with Kimi-VL: respond to text, images, video, PDFs
moonshotai/Kimi-VL-A3B-Thinking-2506

Image-Text-to-Text • 16B • Updated Jan 30 • 6.64k • 362
moonshotai/Kimi-VL-A3B-Instruct

Image-Text-to-Text • 16B • Updated Jan 30 • 295k • 268
moonshotai/Kimi-VL-A3B-Thinking

Image-Text-to-Text • 16B • Updated Jan 30 • 139k • 448

Previous
1
2
3
Next

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs