Abstract: As a fundamental multimodal task, image-text retrieval bridges the gap between vision and language. Current mainstream methods exploit attention mechanisms to discover potential alignments between visual regions and textual words while ignoring the imbalance of image-text information. To this end, we propose a Cross-modal Information Balance-aware Reasoning Network (CIBRN), adopting information balance and similarity reasoning mechanisms to distinguish matched and unmatched image-text pairs in the paper. Specifically, a two-stage information balance scheme is employed to balance image-text information. In the first stage, a Graph Convolutional Network (GCN) with multiple convolution kernels is used to convert elements that only exist in a single modality into common elements to achieve intra-modal information balance indirectly. In the second stage, we propose an information “Add-Reduce” mechanism to realize inter-modal information balance by adding a random feature based on Gaussian distribution to each textual “word” and reducing fixed-length information from each visual “region”. Subsequently, a block-based hierarchical matching method and mean-based fully connected layers are proposed to reason the relevance of images and texts. Extensive experiments on two benchmark datasets, i.e., Flickr30K and MSCOCO, demonstrate the effectiveness of the proposed model CIBRN and achieve advanced results compared to the state-of-the-art method, with a gain of 7.0% and 3.0% on rSum, respectively.
Loading