The DenseCap-Guided Attention Network For Image-Text Matching

Published: 2025, Last Modified: 14 Jan 2026WWW (Companion Volume) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Image-text matching is a typical cross-modal task, which has recently attracted great interest in multimedia and computer vision. Previous image-text matching methods mostly rely on coarse appearance features to guide the learning of image and text monotonous representations whereby attention-aware mechanism is introduced for image-text matching. Such coarse feature guided representations lack fine-grained and diverse semantic information for linking images and sentences which are widely regarded as the important cues in aligning the image-text pairs, leading to mismatches between images and texts at a fine-grained level. In this paper, we propose a novel Densecap-guided Attention Network, termed DAN, as the bridge that allows the integration of fine-grained and diverse dense caption representations as mediation to link images and texts. In particular, the dense captions are first extracted from the given image by a densecap-parser. Then, a densecap-guided attention module is designed to mine the fine-grained and discriminate correspondence of image and sentence. Finally, a graph-structured matching network is utilized to learn the associations and alignments between visual and textual attention-aware features. Quantitative results show that the proposed DAN can outperform the state-of-the-art and alternative approaches under various standard evaluation metrics on two public benchmarks, Microsoft COCO and Flickr30K.
Loading