OmniVideo-30B (Qwen3-Omni)

Project Page Paper Github Dataset

This is the fine-tuned OmniVideo-30B model, initialized from the official Qwen3-Omni-30B-A3B-Instruct and trained on the OmniVideo-100K instruction-tuning dataset introduced in our paper: "OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains".

๐Ÿš€ Performance

Models OmniVideo-Test Daily-Omni OmniVideoBench JointAVBench FutureOmni Video-MMEshort Video-MME-v2
Qwen3-Omni-30B-A3B-Instruct 49.70 74.27 43.84 63.17 53.44 82.00 14.31
OmniVideo-30B (Qwen3-Omni) 63.56+13.86 76.61+2.34 44.81+0.97 66.37+3.20 57.60+4.16 83.56+1.56 15.33+1.02

โš™๏ธ Training and Evaluation

The model was fine-tuned full-parameter using LLaMA-Factory.

For detailed training and evaluation instructions, please refer to the official OmniVideo-100K GitHub repository.

๐Ÿ“‘ Citation

If you find this work useful, please cite our paper:

@article{cai2026omnivideo100k,
  title={OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains},
  author={Cai, Xinyue suicide and Fu, Chaoyou and Zhang, Yi-Fan and He, Ran and Shan, Caifeng},
  journal={arXiv preprint arXiv:2606.14702}, 
  year={2026}
}
Downloads last month
30
Safetensors
Model size
35B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for MiG-NJU/OmniVideo-30B_Qwen3-Omni