OmniVideo-30B (Qwen3-Omni)

This is the fine-tuned OmniVideo-30B model, initialized from the official Qwen3-Omni-30B-A3B-Instruct and trained on the OmniVideo-100K instruction-tuning dataset introduced in our paper: "OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains".

🚀 Performance

Models	OmniVideo-Test	Daily-Omni	OmniVideoBench	JointAVBench	FutureOmni	Video-MME_short	Video-MME-v2
Qwen3-Omni-30B-A3B-Instruct	49.70	74.27	43.84	63.17	53.44	82.00	14.31
OmniVideo-30B _(Qwen3-Omni)	63.56_+13.86	76.61_+2.34	44.81_+0.97	66.37_+3.20	57.60_+4.16	83.56_+1.56	15.33_+1.02

⚙️ Training and Evaluation

The model was fine-tuned full-parameter using LLaMA-Factory.

For detailed training and evaluation instructions, please refer to the official OmniVideo-100K GitHub repository.

📑 Citation

If you find this work useful, please cite our paper:

@article{cai2026omnivideo100k,
  title={OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains},
  author={Cai, Xinyue suicide and Fu, Chaoyou and Zhang, Yi-Fan and He, Ran and Shan, Caifeng},
  journal={arXiv preprint arXiv:2606.14702}, 
  year={2026}
}

Downloads last month: 30

Safetensors

Model size

35B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for MiG-NJU/OmniVideo-30B_Qwen3-Omni

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

Paper • 2606.14702 • Published 10 days ago • 31