arxiv:2606.08572

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Published on Jun 7

· Submitted by

jiahaowang on Jun 9

NJU-LINK Lab

Upvote

Authors:

Jiahao Wang ,

Abstract

OmniCap-IF is introduced as the first comprehensive benchmark for evaluating instruction-following capabilities in omni-modal captioning, revealing significant performance disparities and a format-content tradeoff in multi-modal reasoning.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.

View arXiv page View PDF Project page GitHub 10 Add to collection

Community

wang-jiahao

Paper author Paper submitter about 11 hours ago

OmniCap-IF targets instruction-following in omni-modal video captioning, where models must not only understand visual and audio streams, but also obey complex user-specified structural, stylistic, temporal, visual, audio, and audio-visual constraints. We introduce the OmniCap-IF benchmark for fine-grained checklist-based evaluation, construct the OmniCap-IF-54K instruction-tuning dataset, and train the OmniCaptioner-IF model family to improve controllable omni-video captioning.