Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching

Kun Zhang, Lei Zhang, Bo Hu, Mengxiao Zhu, Zhendong Mao

Published: 01 Jan 2023, Last Modified: 20 Mar 2024ACM Multimedia 2023Readers: Everyone

Abstract: Image-text matching, as a fundamental cross-modal task, bridges vision and language. The key challenge lies in accurately learning the semantic similarity of these two heterogeneous modalities. To determine the semantic similarity between visual and textual features, existing paradigm typically first maps them into a d-dimensional shared representation space, then independently aggregates all dimensional correspondences of cross-modal features to reflect it, e.g., the inner product. However, in this paper, we are motivated by an insightful finding that dimensions are not mutually independent, but there are intrinsic dependencies among dimensions to jointly represent latent semantics. Ignoring this intrinsic information probably leads to suboptimal aggregation for semantic similarity, impairing cross-modal matching learning. To solve this issue, we propose a novel cross-dimensional semantic dependency-aware model (called X-Dim), which explicitly and adaptively mines the semantic dependencies between dimensions in the shared space, enabling dimensions with joint dependencies to be enhanced and utilized. X-Dim (1) designs a generalized framework to learn dimensions' semantic dependency degrees, and (2) devises the adaptive sparse probabilistic learning to autonomously make the model capture precise dependencies. Theoretical analysis and extensive experiments demonstrate the superiority of X-Dim over state-of-the-art methods, achieving 5.9%-7.3% rSum improvements on Flickr30K and MS-COCO benchmarks.

0 Replies