--- license: apache-2.0 library_name: transformers pipeline_tag: video-text-to-text --- # OmniVideo-30B (Qwen3-Omni) [![Project Page](https://img.shields.io/badge/Project-Page-orange)](https://yzlmhzz.github.io/OmniVideo-100K/) [![Paper](https://img.shields.io/badge/arXiv-Paper-b31b1b.svg)](https://arxiv.org/abs/2606.14702) [![Github](https://img.shields.io/badge/Github-OmniVideo--100K-blue)](https://github.com/MiG-NJU/OmniVideo-100K) [![Dataset](https://img.shields.io/badge/Dataset-OmniVideo--100K-green)](https://huggingface.co/datasets/MiG-NJU/OmniVideo-100K) This is the fine-tuned **OmniVideo-30B** model, initialized from the official **Qwen3-Omni-30B-A3B-Instruct** and trained on the **[OmniVideo-100K](https://huggingface.co/datasets/MiG-NJU/OmniVideo-100K)** instruction-tuning dataset introduced in our paper: *"[OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains](https://arxiv.org/abs/2606.14702)"*. ## 🚀 Performance

Models	OmniVideo-Test	Daily-Omni	OmniVideoBench	JointAVBench	FutureOmni	Video-MME_short	Video-MME-v2
Qwen3-Omni-30B-A3B-Instruct	49.70	74.27	43.84	63.17	53.44	82.00	14.31
OmniVideo-30B _(Qwen3-Omni)	63.56_+13.86	76.61_+2.34	44.81_+0.97	66.37_+3.20	57.60_+4.16	83.56_+1.56	15.33_+1.02

## ⚙️ Training and Evaluation The model was fine-tuned full-parameter using `LLaMA-Factory`. For detailed training and evaluation instructions, please refer to the official [OmniVideo-100K GitHub repository](https://github.com/MiG-NJU/OmniVideo-100K). ## 📑 Citation If you find this work useful, please cite our paper: ```bibtex @article{cai2026omnivideo100k, title={OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains}, author={Cai, Xinyue suicide and Fu, Chaoyou and Zhang, Yi-Fan and He, Ran and Shan, Caifeng}, journal={arXiv preprint arXiv:2606.14702}, year={2026} } ```