Keywords: Cross-modal Hashing, Unsupervised Hash Retrieval, Cross-modal Retireval
Abstract: Cross-modal retrieval is a significant task that aims to learn the semantic correspondence between visual and textual modalities. Unsupervised hashing methods can efficiently manage large-scale data and can be effectively applied to cross-modal retrieval studies. However, existing methods typically fail to fully exploit the hierarchical structure between text and image data. Moreover, the commonly used direct modal alignment cannot effectively bridge the semantic gap between these two modalities. To address these issues, we introduce a novel Hierarchical Encoding Tree with Modality Mixup (HINT) method, which achieves effective cross-modal retrieval by extracting hierarchical cross-modal relations. HINT constructs a cross-modal encoding tree guided by hierarchical structural entropy and generates proxy samples of text and image modalities for each instance from the encoding tree. Through the curriculum-based mixup of proxy samples, HINT achieves progressive modal alignment and effective cross-modal retrieval. Furthermore, we conduct cross-modal consistency learning to achieve global-view semantic alignment between text and image representations. Extensive experiments on a range of cross-modal retrieval datasets demonstrate the superiority of HINT over state-of-the-art methods.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 16096
Loading