Abstract: Image-text matching, a core task in multimodal learning that aligns visual and textual semantics, faces two critical challenges: (1) existing graph-based methods often struggle to balance over-connection and semantic loss due to rigid thresholding strategies, and (2) single-level interaction mechanisms fail to capture hierarchical cross-modal dependencies effectively. As a response to the identified problems, our research proposes an advanced framework integrating Dynamic Semantic Graph Enhancement (DSGE) with Progressive Semantic Alignment (PSA). The DSGE module adaptively adjusts graph connectivity based on the statistical properties of similarity distributions, overcoming the limitations of manually defined thresholds that typically result in either over-connection or the omission of critical relationships. The PSA module establishes coarse-grained correspondences through bidirectional cross-modal attention and progressively refines alignment precision using a context-aware hierarchical strategy. Comprehensive evaluations on Flickr30K and MS-COCO, particularly in complex semantic scenarios, confirm our framework achieving significant performance gains over existing methods.
External IDs:dblp:conf/icic/WangLZZ25
Loading