Keywords: Text-video retrieval, Prototype-based learning, Clustering
Abstract: This work addresses the Intra-Inter Conflict (IIC) dilemma in text-video retrieval, _i.e._, (a) intra-category variance, refers to category-consistent instances that display substantial distributional disparity, and (b) inter-category similarity, where instances belonging to different categories exhibit distributional coupling. Through an analysis of the learned feature and recalled samples of current models, we posit this conflict stems from the appearance bias issue, _i.e._, the matching process is dominated by superficial semantics shared across samples, which undermines the contribution of discriminative semantics. To this end, we propose Prototype-based Regularization Learning (PRL), which regularizes the semantic boundaries of features through a set of prototypes, so as to maximally compel the model to learn compact and distinctive representations for text-video retrieval task. Specifically, PRL performs within- and cross-instance clustering in the embedding space to assign informative prototypes to instances with similar categories. Moreover, a Prototype Discriminating Loss (PDL) is proposed that makes semantically correlated instances self-organize around their respective prototype while maintaining separation across different ones, and a Prototype Projection Loss (PPL) is devised to align video and text features by adaptively projecting prototypes into a shared semantic manifold, thereby fostering cross-modal correspondence.
Extensive experiments on five datasets demonstrate that the proposed model-agnostic strategy significantly boosts the performance of existing models, _e.g._, improving TempMe, X-Pool, and CLIP4Clip by +6.5%, +3.1%, and +5.0% of SumR on the MSR-VTT dataset.
Code available at: https://anonymous.4open.science/r/PRL-200D.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3109
Loading