Keywords: Composed Video Retrieval; Multimodal Benchmark; Audio-Visual Queries
Abstract: Composed video retrieval requires retrieving a target video given a source video and a textual modification instruction, demanding fine-grained reasoning over multimodal transformations. Existing benchmarks primarily emphasize vision–text alignment and overlook the rich semantic signals in audio, such as speech, music, and environmental sounds, which are often decisive for understanding. To address this limitation, we introduce OmniCVR, a large-scale benchmark for omni-composed video retrieval that integrates vision, audio, and text as first-class modalities. OmniCVR is built through a scalable pipeline that combines segmentation, omni-modal annotation, and dual validation by both large language models and human experts. The benchmark features vision-centric, audio-centric, and integrated queries, with integrated queries forming the majority to better reflect real-world multimodal scenarios. We further propose AudioVLM2Vec, an audio-aware extension of VLM2Vec that incorporates explicit audio semantics, achieving state-of-the-art performance and exposing fundamental gaps in current multimodal retrieval systems.
Primary Area: datasets and benchmarks
Submission Number: 18858
Loading