Structure-Aware Adaptive Hybrid Interaction Modeling for Image-Text Matching

Published: 01 Jan 2024, Last Modified: 20 May 2025MMM (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Image-text matching is a rapidly evolving field in multimodal learning, aiming to measure the similarity between image and text. Despite significant progress in image-text matching in recent years, most existing methods for image-text interaction rely on static ways, often overlooking the substantial variations in scene complexity among different samples. Actually, multimodal interaction strategies should be flexibly adjusted according to the scene complexity of different inputs. For instance, excessive multimodal interactions may introduce noise when dealing with simple samples. In this paper, we propose a novel Structure-aware Adaptive Hybrid Interaction Modeling (SAHIM) network, which can adaptively adjust the image-text interaction strategies based on varying inputs. Moreover, we design the Multimodal Graph Inference (MGI) module to explore potential structural connections between global and local features, as well as the Entity Attention Enhancement (EAE) module to filter out irrelevant local segments. Finally, we align the image and text features with the bidirectional triplet loss function. To validate the proposed SAHIM model, we design and conduct comprehensive experiments on Flickr30K and MSCOCO. Experimental results show that SAHIM outperforms state-of-the-art methods on both datasets, demonstrating the superiority of our model.
Loading