Abstract: Due to its increasing importance, cross-modal retrieval (CMR), where the query from one modality is used to retrieve objects from a different modality, has gained a lot of attention. A plethora of techniques have been proposed for this task, with deep learnt multi-modal models being the dominant paradigm. While these techniques have become increasingly sophisticated in terms of learning representations of multi-modal objects in a common space, relatively less attention is paid to the overall computational costs involved while training the model and during retrieval. In this work, we present LCM (Lightweight framework for Cross-Modal retrieval), a surprisingly effective approach with very low computational costs. It can work with any uni- and multi-modal representations that is available ranging from BoW/GIST to CLIP for text/image modality. In its training phase, LCM exploits the semantic labels with a combination of shallow modality-specific feed-forward network and a label auto-encoder such that embeddings in the common representation space that share labels are close to each other. During retrieval, LCM employs a novel 2-stage nearest neighbor (2Sknn) search to first rank candidate labels that are relevant to a query (stage-1), and then use this ranking to retrieve results from the indexed collection (stage-2). Experiments over 6 popular uni- and multi-label supervised CMR benchmarks show that LCM outperforms some of the very recent strong baselines by upto 20% gains in mAP values. Furthermore, we show that 2Sknn can benefit other baseline methods as well offering upto 50% mAP gains in some cases.
0 Replies
Loading