Abstract: Supervised cross-modal retrieval (CMR) achieves excellent performance thanks to the semantic information provided by its labels, which helps to establish semantic correlations between samples from different modalities. However, in real-world scenarios, there often exists a large amount of unlabeled and unpaired multimodal training data, rendering existing methods unfeasible. To address this issue, we
propose a novel partially aligned cross-modal retrieval method called Optimal Transport-based Prototype Alignment Learning (OTPAL). Due to the high computational complexity involved in directly establishing matching correlations between unannotated unaligned cross-modal samples, instead, we establish matching correlations between shared prototypes and samples. To be specific, we employ the optimal transport algorithm to establish cross-modal alignment information between samples and prototypes, and then minimize the distance between samples and their corresponding prototypes through a specially designed prototype alignment loss. As an extension of this paper, we also extensively investigate the influence of incomplete multimodal data on cross-modal retrieval performance under the partially aligned setting proposed above. To further address the above more challenging scenario, we raise a scalable prototype-based neighbor feature completion method, which better captures the correlations between incomplete samples and neighbor samples through a cross-modal self-attention mechanism. Experimental results on four benchmark datasets show that our method can obtain satisfactory accuracy and scalability in various real-world scenarios.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: 1. We clarify two key challenges in enhancing the robustness of cross-modal retrieval systems in practical applications. Furthermore, we propose a prototype-based optimal transport learning strategy for partially aligned cross-modal retrieval.
2. We propose a novel framework for partially aligned cross-modal retrieval, which minimizes the distance between samples and their corresponding prototypes, thereby enhancing modality discriminability and invariance.
3. We design a scalable cross-modal prototype and neighbor completion method to address incomplete partially aligned intractable scenarios.
Submission Number: 4857
Loading