Abstract: We study and address the cross-modal retrieval problem which lies at the heart of visual-textual processing. Its major challenge lies in how to effectively learn a shared multi-modal feature space where the discrepancies of semantically related pairs, such as images and texts, are minimized regardless of their modalities. Most current methods focus on reasoning about cross-modality semantic relations within individual image-text pair to learn the common representation. However, they overlook more global, structural inter-pair knowledge within the dataset, i.e., the graph-structured semantics within each training batch. In this paper, we introduce a graph-based, semantic-constrained learning framework to comprehensively explore the intra- and inter-modality information for cross-modal retrieval. Our idea is to maximally explore the structures of labeled data in graph latent space, and use them as semantic constraints to enforce feature embeddings from the semantically-matched (image-text) pairs to be more similar and vice versa. It raises a novel graph-constrained common embedding learning paradigm for cross-modal retrieval, which is largely under-explored up to now. Moreover, a GAN-based dual learning approach is used to further improve the discriminability and model the joint distribution across different modalities. Our fully-equipped approach, called Graph-constrained Cross-modal Retrieval (GCR), is able to mine intrinsic structures of training data for model learning and enable reliable cross-modal retrieval. We empirically demonstrate that our GCR can achieve higher accuracy than existing state-of-the-art approaches on Wikipedia, NUS-WIDE-10K, PKU XMedia and Pascal Sentence datasets. Our code will be made publicly available. Code is available at https://github.com/neoscheung/GCR.
0 Replies
Loading