Hierarchical Encoding Tree with Modality Mixup for Cross-modal Hashing

Published: 26 Jan 2026, Last Modified: 08 Mar 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Cross-modal Hashing, Unsupervised Hash Retrieval, Cross-modal Retireval
Abstract: Cross-modal retrieval is a fundamental task that aims to learn semantic correspondences across different data modalities, such as visual and textual modalities. Unsupervised hashing methods can efficiently manage large-scale data and can be effectively applied to cross-modal retrieval studies.However, existing methods typically fail to fully exploit the hierarchical semantic structure within text and image data, where instances naturally organize into multi-level communities of varying granularity. Moreover, the commonly-used direct modal alignment cannot effectively bridge the semantic gap between these two modalities. To address these issues, we introduce a novel Hierarchical Encoding Tree with Modality Mixup (HINT) method, which achieves effective cross-modal retrieval by extracting hierarchical cross-modal relations. HINT constructs a cross-modal encoding tree guided by hierarchical structural entropy and generates proxy samples of text and image modalities for each instance from the encoding tree. Through the curriculum-based mixup of proxy samples, HINT achieves progressive modal alignment and effective cross-modal retrieval. We also conduct cross-modal consistency learning to achieve global-view semantic alignment between text and image representations. Extensive experiments on a range of cross-modal retrieval datasets demonstrate the superiority of HINT over state-of-the-art methods.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 16096
Loading