Abstract: Image and text matching measures the semantic similarity for cross-modal retrieval. The core of this task is semantic embedding, which mines the intrinsic characteristics of visual and textual for discriminative representation. However, cross-modal ambiguity of image and text (the existence of one-to-many associations) is prone to semantic diversity. The mainstream approaches utilized the fixed point embedding to represent semantics, which ignored the embedding uncertainty caused by semantic diversity leading to incorrect results. To address this issue, we propose a novel Semantic Embedding Uncertainty Learning (SEUL), which represents the embedding uncertainty of image and text as Gaussian distributions and simultaneously learns the salient embedding (mean) and uncertainty (variance) in the common space. We design semantic uncertainty embedding for facilitating the robustness of the representation in the semantic diversity context. A combined objective function is proposed, which optimizes the semantic uncertainty and maintains discriminability to enhance cross-modal associations. Extended experiments are performed on two datasets to demonstrate advanced performance.
External IDs:dblp:conf/icmcs/WangSLYZLL23
Loading