--- license: apache-2.0 library_name: transformers pipeline_tag: video-text-to-text --- # OmniVideo-30B (Qwen3-Omni) [![Project Page](https://img.shields.io/badge/Project-Page-orange)](https://yzlmhzz.github.io/OmniVideo-100K/) [![Paper](https://img.shields.io/badge/arXiv-Paper-b31b1b.svg)](https://arxiv.org/abs/2606.14702) [![Github](https://img.shields.io/badge/Github-OmniVideo--100K-blue)](https://github.com/MiG-NJU/OmniVideo-100K) [![Dataset](https://img.shields.io/badge/Dataset-OmniVideo--100K-green)](https://huggingface.co/datasets/MiG-NJU/OmniVideo-100K) This is the fine-tuned **OmniVideo-30B** model, initialized from the official **Qwen3-Omni-30B-A3B-Instruct** and trained on the **[OmniVideo-100K](https://huggingface.co/datasets/MiG-NJU/OmniVideo-100K)** instruction-tuning dataset introduced in our paper: *"[OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains](https://arxiv.org/abs/2606.14702)"*. ## 🚀 Performance
Models OmniVideo-Test Daily-Omni OmniVideoBench JointAVBench FutureOmni Video-MMEshort Video-MME-v2
Qwen3-Omni-30B-A3B-Instruct 49.70 74.27 43.84 63.17 53.44 82.00 14.31
OmniVideo-30B (Qwen3-Omni) 63.56+13.86 76.61+2.34 44.81+0.97 66.37+3.20 57.60+4.16 83.56+1.56 15.33+1.02
## ⚙️ Training and Evaluation The model was fine-tuned full-parameter using `LLaMA-Factory`. For detailed training and evaluation instructions, please refer to the official [OmniVideo-100K GitHub repository](https://github.com/MiG-NJU/OmniVideo-100K). ## 📑 Citation If you find this work useful, please cite our paper: ```bibtex @article{cai2026omnivideo100k, title={OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains}, author={Cai, Xinyue suicide and Fu, Chaoyou and Zhang, Yi-Fan and He, Ran and Shan, Caifeng}, journal={arXiv preprint arXiv:2606.14702}, year={2026} } ```