Semantic-aware matrix factorization hashing with intra- and inter-modality fusion for image-text retrieval

Dongxue Shi, Zheng Liu, Shanshan Gao, Ang Li

Published: 2025, Last Modified: 11 Apr 2025Appl. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Cross-modal retrieval aims to retrieve related items in one modality using a query from another modality. As the foundational and key challenge of it, image-text retrieval has garnered significant research interest from scholars. In recent years, hashing techniques have gained widespread interest for large-scale dataset retrieval due to their minimal storage requirements and rapid query processing capabilities. However, existing hashing approaches either learn unified representations for both modalities or specific representations within each modality. The former approach lacks modality-specific information, while the latter does not consider the relationships between image-text pairs across various modalities. Therefore, we propose an innovative supervised hashing method that leverages intra-modality and inter-modality matrix factorization. This method integrates semantic labels into the hash code learning process, aiming to understand both inter-modality and intra-modality relationships within a unified framework for diverse data types. The objective is to preserve inter-modal complementarity and intra-modal consistency in multimodal data. Our approach involves: (1) mapping data from various modalities into a shared latent semantic space through inter-modality matrix factorization to derive unified hash codes, and (2) mapping data from each modality into modality-specific latent semantic spaces via intra-modality matrix factorization to obtain modality-specific hash codes. These are subsequently merged to construct the final hash codes. Experimental results demonstrate that our approach surpasses several state-of-the-art cross-modal image-text retrieval hashing methods. Additionally, ablation studies further validate the effectiveness of each component within our model.