Spot the Key, Recover the Rest: Dual-Path&View Representation Learning for Text-Video Retrieval

20 Sept 2025 (modified: 23 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-modal representation learning, text-video retrieval, feature enhancement and interaction.
Abstract: In recent years, CLIP-based text-video retrieval methods have developed rapidly, with research primarily focusing on exploiting diverse textual and visual cues to achieve effective feature interaction. However, an accurate retrieval model not only requires strong feature enhancement techniques, \textit{e.g.}, text expansion, but also needs coarse-fine granularity interaction strategies, \textit{e.g.}, word-patch. To overcome the limitations of these two types of challenges, we propose a novel text-video retrieval framework, \textbf{SKRR}, \textit{i.e.}, \textbf{S}pot the \textbf{K}ey, \textbf{R}ecover the \textbf{R}est, which consists of the Dual-Path Feature Partitioning module (DPFP) for feature enhancement and the Dual-View Feature Interaction module (DVFI) for feature interaction. For DPFP, we simulate the human macro-level cognitive perspective by partitioning visual features into two categories based on their relevance to the text query, and supplementing the less relevant features with additional textual. For DVIF, we simulates the human alignment strategy from macro- to micro-level, effectively focusing on local visual features and comprehensively considering fine-grained interactions. DPFP and DVFI collaborate synergistically, jointly promoting cross-modal feature enhancement and interaction. We evaluate SKRR model on five benchmark datasets, including MSRVTT (50.5\%), achieving state-of-the-art retrieval performance.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 23185
Loading