Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: cross-modal retrieval, text-video retrieval, video understanding
TL;DR: We introduce GARE, a gap-aware retrieval framework that learns pair-specific increments to alleviate optimization tension and false-negative noise in cross-modal alignment, achieving better uniformity and semantic structure.
Abstract: Recent progress in text–video retrieval has been largely driven by contrastive learning. However, existing methods often overlook the effect of the modality gap, which causes anchor representations to undergo in-place optimization (i.e., optimization tension) that limits their alignment capacity. Moreover, noisy hard negatives further distort the semantics of anchors. To address these issues, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment $\Delta_{ij}$ between text $t_i$ and video $v_j$, redistributing gradients to relieve optimization tension and absorb noise. We derive $\Delta_{ij}$ via a multivariate first-order Taylor expansion of the InfoNCE loss under a trust-region constraint, showing that it guides updates along locally consistent descent directions. A lightweight neural module conditioned on the semantic gap couples increments across batches for structure-aware correction. Furthermore, we regularize $\Delta$ through a variational information bottleneck with relaxed compression, enhancing stability and semantic consistency. Experiments on four benchmarks demonstrate that GARE consistently improves alignment accuracy and robustness, validating the effectiveness of gap-aware tension mitigation.
Supplementary Material: zip
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 15119
Loading