Fine-Grained Matching with Multi-Perspective Similarity Modeling for Cross-Modal RetrievalDownload PDF

Published: 20 May 2022, Last Modified: 05 May 2023UAI 2022 PosterReaders: Everyone
Keywords: Image-text retrieval, Semantic knowledge, Similarity representation learning, Similarity-relation learning, Graph neural network
Abstract: Cross-modal retrieval relies on learning inter-modal correspondences. Most existing approaches focus on learning global or local correspondence and fail to explore fine-grained multi-level alignments. Moreover, it remains to be investigated how to infer more accurate similarity scores. In this paper, we propose a novel fine-grained matching with Multi-Perspective Similarity Modeling (MPSM) Network for cross-modal retrieval. Specifically, the Knowledge Graph Iterative Dissemination (KGID) module is designed to iteratively broadcast global semantic knowledge, enabling domain information to be integrated and relevant nodes to be associated, resulting in fine-grained modality representations. Subsequently, vector-based similarity representations are learned from multiple perspectives to model multi-level alignments comprehensively. The Relation Graph Reconstruction (SRGR) module is further developed to enhance cross-modal correspondence by constructing similarity relation graphs and adaptively reconstructing them. Extensive experiments on the Flickr30K and MSCOCO datasets validate that our model significantly outperforms several state-of-the-art baselines.
Supplementary Material: zip
4 Replies

Loading