Abstract: The requirement of image–text matching is to retrieve matching images or texts based on textual or visual queries. However, image–text matching is inherently a many-to-many problem, as an image can correspond to multiple levels of visual semantic scenes, which can be described by different texts. Similarly, textual descriptions can be visualized through multiple visual scenes. This leads to ambiguity in the matching between images and texts. To better capture these matching relationships, we employ graph convolutional networks to extract multi-level semantic information for image–text pairs, and construct Gaussian distribution representations for image and text instead of conventional point representations. Furthermore, we introduce a inter-modal mixture of Gaussian distribution to constrain the matching relationships between image–text pairs, which ensures more precise distribution representations in a shared space and strengthens the correlation between cross-modal. We conducted experiments on Flickr30K and MS-COCO, which are two widely used datasets, demonstrates the superior performance of our approach.
External IDs:dblp:journals/ipm/LiuYLNLC25
Loading