Abstract: In recent years, driven by the increasing number of cross-modal data such as images and texts, cross-modal retrieval has received intensive attention. Great progress has made in deep cross-modal hash retrieval, which integrates feature leaning and hash learning into an end-to-end trainable framework to obtain the better hash codes. However, due to the heterogeneity between images and texts, it is still a challenge to compare the similarity between them. Most previous approaches embed images and texts into a joint embedding subspace independently and then compare their similarity, which ignore the influence of irrelevant regions (regions in images without the corresponding textual description) on cross-modal retrieval and the fine-grained interactions between images and texts. To address these issues, a new cross-modal hashing called Deep Translated Attention Hashing for Cross-Modal Retrieval (DTAH) is proposed. Firstly, DTAH extracts image and text features through the bottom-up attention and the recurrent neural network respectively to reduce the influence of irrelevant regions on cross-modal retrieval. Then, with the help of cross-modal attention module, DTAH captures the fine-grained interactions between vision and language at region level and word level, and then embeds the text features into the image feature space. In this way, the proposed DTAH effectively shrinks the heterogeneity between images and texts, and can learn the discriminative hash codes. Extensive experiments on three benchmark data sets demonstrate that DTAH surpasses the state-of-the-art methods.
0 Replies
Loading