Multi-Networks Joint Learning for Large-Scale Cross-Modal Retrieval

Liang Zhang, Bingpeng Ma, Guorong Li, Qingming Huang, Qi Tian

Published: 2017, Last Modified: 28 Feb 2026ACM Multimedia 2017EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper proposes a novel deep framework of multi-networks joint learning for large-scale cross-modal retrieval. For most existing cross-modal methods, the processes of training and testing don't care about the problem of memory requirement. Hence, they are generally implemented on small-scale data. Moreover, they take feature learning and latent space embedding as two separate steps which cannot generate specific features to accord with the cross-modal task. To alleviate the problems, we first disintegrate the multiplication and inverse of some big matrices, usually involved in existing methods, into that of many sub-matrices. Each sub-matrix is targeted to dispose one pair of image-sentence, for which we further design a novel sampling strategy to select the most representative samples to construct the cross-modal ranking loss and within-modal discriminant loss functions. By this way, the proposed model consumes less memory each time such that it can scale to large-scale data. Furthermore, we apply the proposed discriminative ranking loss to effectively unify two heterogenous networks, deep residual network for images and long short-term memory for sentences, into an end-to-end deep learning architecture. Finally, we can simultaneously achieve specific features adapting to cross-modal task and learn a shared latent space for images and sentences. Extensive evaluations on two large-scale cross-modal datasets show that the proposed method brings substantial improvements over other state-of-the-art ranking methods.