Abstract: Existing cross-modal fusion methods for temporal language grounding suffer from issues where the generated cross-modal embeddings are affected by noise in their unimodal representations, thus hindering the expression of interactions between language query and the target video segment. Furthermore, the cross-modal representations contain many irrelevant redundancies, which compromises the quality of cross-modal features and thus interferes with accurate moment localization. To address these drawbacks, we propose a novel CrOss-modaL information-constraineD (COLD) model for temporal language grounding, aiming at learning a robust cross-modal embedding representation devoid of irrelevant redundancies and maximizing the interaction between language query and target video moment. Specifically, our model is built upon the principles of the information bottleneck and features two information-constrained modules from different perspectives: 1) the Cross-modal Highlight Information Bottleneck module is designed to maximize the mutual information between language query and target video moment; 2) the Fusion Information Bottleneck module is introduced to constrain the correlations between the cross-modal representations and the localization labels. Comprehensive experimental results on two public benchmark datasets demonstrate the superiority of the proposed model.
Loading