Abstract: Cross-modal hashing has attracted noticeable attention in the multimedia community. Two-stage methods often show impressive performance by first learning hash codes for data instances from different modalities, then learning hash functions that map the original multimodal data to the low dimensional hash codes. However, most existing two-stage methods can hardly obtain satisfactory hash codes at the first stage, as the commonly used coarse-grained similarity matrix fails to capture the differentiated similarity relationships between the original data instances. Besides, such methods cannot obtain satisfactory hash functions at the second stage, where the learning of hash functions is treated as a multi-binary classification problem. In this paper, we propose a novel two-stage hashing method for cross-modal retrieval. At the first stage, we capture the differentiated similarity relationships between data instances by designing a fine-grained similarity matrix and add an Autoencoder to mine the semantic information. At the second stage, we introduce a similarity sensitivity learning strategy under the guidance of the similarity matrix to train the hash functions. This strategy makes the training process sensitive to the similar and hard pairs, boosting the retrieval performance. Comprehensive experiments on three benchmark datasets validate the effectiveness of our method.
Loading