OmniCVR: A Benchmark for Omni-Composed Video Retrieval with Vision, Audio, and Text

OmniCVR: A Benchmark for Omni-Composed Video Retrieval with Vision, Audio, and Text

ICLR 2026 Conference Submission18858 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Composed Video Retrieval; Multimodal Benchmark; Audio-Visual Queries

Abstract: Composed video retrieval requires retrieving a target video given a source video and a textual modification instruction, demanding fine-grained reasoning over multimodal transformations. Existing benchmarks primarily emphasize vision–text alignment and overlook the rich semantic signals in audio, such as speech, music, and environmental sounds, which are often decisive for understanding. To address this limitation, we introduce OmniCVR, a large-scale benchmark for omni-composed video retrieval that integrates vision, audio, and text as first-class modalities. OmniCVR is built through a scalable pipeline that combines segmentation, omni-modal annotation, and dual validation by both large language models and human experts. The benchmark features vision-centric, audio-centric, and integrated queries, with integrated queries forming the majority to better reflect real-world multimodal scenarios. We further propose AudioVLM2Vec, an audio-aware extension of VLM2Vec that incorporates explicit audio semantics, achieving state-of-the-art performance and exposing fundamental gaps in current multimodal retrieval systems.

Primary Area: datasets and benchmarks

Submission Number: 18858

Loading