Captioning for Text-Video Retrieval via DualGroup-Direct Preference Optimization

ACL ARR 2025 May Submission2898 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In text-video retrieval, auxiliary captions are often used to enhance video understanding, bridging the gap between the modalities. Recently, with the remarkable capabilities in multi-modal understanding, retrieval with MLLMs has emerged as a promising direction. However, we identify two key limitations: (1) retrieval models often fail to effectively leverage the auxiliary captions, neglecting the semantic distinction between the caption (as contextual knowledge) and text queries (as retrieval targets); and (2) auxiliary captions are not typically tailored for retrieval, evaluated with language generation metrics such as BLEU that misalign with retrieval objectives, which require fine-grained discrimination. To address these challenges, we propose CaRe-DPO, a retrieval framework that integrates two key components. First, retrieval role-embeddings are introduced to explicitly differentiate between the roles of heterogeneous textual inputs, allowing the model to better utilize auxiliary captions during retrieval. Second, we present DualGroup-Direct Preference Optimization (DG-DPO), a novel caption optimization strategy that directly uses retrieval relevance scores to supervise caption quality. Moreover, unlike traditional DPO, DG-DPO incorporates group-level preferences, enabling the model to learn a global retrieval ranking over video-caption pairs. Through extensive experiments, we show that CaRe-DPO significantly improves retrieval performance by effectively utilizing the auxiliary knowledge while generating better captions for retrieval.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Text-Video Retrieval, Direct Preference Optimization
Languages Studied: English
Submission Number: 2898
Loading