Papers
arxiv:2606.08572

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Published on Jun 7
· Submitted by
jiahaowang
on Jun 9
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

OmniCap-IF is introduced as the first comprehensive benchmark for evaluating instruction-following capabilities in omni-modal captioning, revealing significant performance disparities and a format-content tradeoff in multi-modal reasoning.

While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.

Community

Paper author Paper submitter

OmniCap-IF targets instruction-following in omni-modal video captioning, where models must not only understand visual and audio streams, but also obey complex user-specified structural, stylistic, temporal, visual, audio, and audio-visual constraints. We introduce the OmniCap-IF benchmark for fine-grained checklist-based evaluation, construct the OmniCap-IF-54K instruction-tuning dataset, and train the OmniCaptioner-IF model family to improve controllable omni-video captioning.

overview_framework

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.08572
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.08572 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.