Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text Matching
Abstract: Many contrastive learning based models have achieved advanced performance in image-text matching tasks. The key of these models lies in analyzing the correlation between image-text pairs, which involves cross-modal interaction of embeddings in corresponding dimensions. However, the embeddings of different modalities are from different models or modules, and there is a significant modality gap. Directly interacting such embeddings lacks rationality and may capture inaccurate correlation. Therefore, we propose a novel method called DIAS to bridge the modality gap from two aspects: (1) We align the information representation of embeddings from different modalities in corresponding dimension to ensure the correlation calculation is based on interactions of similar information. (2) The spatial constraints of inter- and intra-modalities unmatched pairs are introduced to ensure the effectiveness of semantic alignment of the model. Besides, a sparse correlation algorithm is proposed to select strong correlated spatial relationships, enabling the model to learn more significant features and avoid being misled by weak correlation. Extensive experiments demonstrate the superiority of DIAS, achieving 4.3\%-10.2\% rSum improvements on Flickr30k and MSCOCO benchmarks.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: The modality gap in image-text matching task has been a long-standing research challenge. It leads to inaccurate learning of correlation for embeddings of images and texts, and even results in a lack of rationality of the correlation calculation. We propose a novel model called DIAS, which bridges the modality gap by aligning information representation of embeddings in each dimension and enhancing constraints between unmatched image-text pairs. Our contributions are summarized as follows:
(1) We propose a dimension information alignment method for embeddings of different modalities, aiming to enhance the rationality of cross-modal interaction and suppress feature redundancy.
(2) We introduce novel inter- and intra-modality constraints to ensure the effectiveness of semantic alignment.
(3) A sparse correlation algorithm is proposed to select strong correlated spatial relationships, reducing the need for symmetry of embeddings.
Submission Number: 3911
Loading