PC$^2$: Pseudo-Classification Based Pseudo-Captioning for Noisy Correspondence Learning in Cross-Modal Retrieval
Abstract: In the realm of cross-modal retrieval, seamlessly integrating diverse modalities within multimedia remains a formidable challenge, especially given the complexities introduced by noisy correspondence learning (NCL). Such noise often stems from mismatched data pairs, which is a significant obstacle distinct from traditional noisy labels. This paper introduces Pseudo-Classification based Pseudo-Captioning (PC$^2$) framework to address this challenge. PC$^2$ offers a threefold strategy: firstly, it establishes an auxiliary "pseudo-classification" task that interprets captions as categorical labels, steering the model to learn image-text semantic similarity through a non-contrastive mechanism. Secondly, unlike prevailing margin-based techniques, capitalizing on PC$^2$'s pseudo-classification capability, we generate pseudo-captions to provide more informative and tangible supervision for each mismatched pair. Thirdly, the oscillation of pseudo-classification is borrowed to assistant the correction of correspondence. In addition to technical contributions, we develop a realistic NCL dataset called Noise of Web (NoW), which could be a new powerful NCL benchmark where noise exists naturally. Empirical evaluations of PC$^2$ showcase marked improvements over existing state-of-the-art robust cross-modal retrieval techniques on both simulated and realistic datasets with various NCL settings. The contributed dataset and source code are released at https://github.com/alipay/PC2-NoiseofWeb.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: This work introduces Pseudo-Classification based Pseudo-Captioning (PC2) framework, tackling significant challenges from noisy correspondence learning (NCL) in cross-modal retrieval. Additionally, this work introduces a realistic dataset, Noise of Web (NoW), provides a comprehensive new benchmark for evaluating NCL, advancing research in multimodal learning. This innovation of PC2 lies primarily in three areas. First, PC2 establishes an auxiliary "pseudo-classification" task, interpreting captions as categorical labels, which steers the model towards learning image-text semantic similarities via a non-contrastive mechanism. This represents a novel approach to integrating diverse modalities within multimedia. Second, by leveraging pseudo-classification, PC2 generates pseudo-captions, providing informative and tangible supervision for mismatched pairs, diverging from traditional margin-based methods. This offers a new perspective for multimodal data processing. Third, it utilizes the oscillation of pseudo-classification to assist in the correction of correspondence issues, enhancing the robustness against noisy correspondence. The approach with tangible improvements over existing techniques and the collection of new realistic dataset mark a substantial contribution to the field of multimedia and multimodal processing.
Supplementary Material: zip
Submission Number: 648
Loading