Abstract: Cross-modal hashing retrieval has emerged as a promising approach due to its advantages in storage efficiency and query speed for handling diverse multimodal data. However, existing cross-modal hashing retrieval methods often oversimplify similarity by solely considering identical labels across modalities and are sensitive to noise in the original multimodal data. To tackle this challenge, we propose a cross-modal hashing retrieval approach with compatible triplet representation. In the proposed approach, we integrate the essential feature representations and semantic information from text and images into their corresponding multi-label feature representations, and introduce a fusion attention module to extract text and image modalities with channel and spatial attention features, respectively, thereby enhancing compatible triplet-based semantic information in cross-modal hashing learning. Comprehensive experiments demonstrate the superiority of the proposed approach in retrieval accuracy compared to state-of-the-art methods on three public datasets.
Loading