OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains
Paper โข 2606.14702 โข Published โข 31
How to use MiG-NJU/OmniVideo-30B_Qwen3-Omni with Transformers:
# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM
processor = AutoProcessor.from_pretrained("MiG-NJU/OmniVideo-30B_Qwen3-Omni")
model = AutoModelForMultimodalLM.from_pretrained("MiG-NJU/OmniVideo-30B_Qwen3-Omni")This is the fine-tuned OmniVideo-30B model, initialized from the official Qwen3-Omni-30B-A3B-Instruct and trained on the OmniVideo-100K instruction-tuning dataset introduced in our paper: "OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains".
| Models | OmniVideo-Test | Daily-Omni | OmniVideoBench | JointAVBench | FutureOmni | Video-MMEshort | Video-MME-v2 |
|---|---|---|---|---|---|---|---|
| Qwen3-Omni-30B-A3B-Instruct | 49.70 | 74.27 | 43.84 | 63.17 | 53.44 | 82.00 | 14.31 |
| OmniVideo-30B (Qwen3-Omni) | 63.56+13.86 | 76.61+2.34 | 44.81+0.97 | 66.37+3.20 | 57.60+4.16 | 83.56+1.56 | 15.33+1.02 |
The model was fine-tuned full-parameter using LLaMA-Factory.
For detailed training and evaluation instructions, please refer to the official OmniVideo-100K GitHub repository.
If you find this work useful, please cite our paper:
@article{cai2026omnivideo100k,
title={OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains},
author={Cai, Xinyue suicide and Fu, Chaoyou and Zhang, Yi-Fan and He, Ran and Shan, Caifeng},
journal={arXiv preprint arXiv:2606.14702},
year={2026}
}