Keywords: DiffNCL, Cross-Modal Retrieval, Noisy Correspondence Learning, Forward–Reverse Diffusion
TL;DR: Our work pioneers the integration of diffusion dynamics into noisy correspondence learning by proposing DiffNCL.
Abstract: Current noisy correspondence learning (NCL) pipelines typically treat correspondence quality as a binary variable, neglecting the abundant category of weakly-noisy correspondences. Two persistent issues are introduced: (i) over-exclusion, where partially informative pairs are discarded as negatives, shrinking the effective data manifold, and (ii) under-alignment, where residual noise from weakly mismatched pairs propagates through gradient updates, degrading representation fidelity. To address these challenges, this work pioneers a unified forward–reverse diffusion framework called "DiffNCL" to explicitly amplify and subsequently purify weakly noisy correspondences for robust noisy correspondence learning. In the forward diffusion, synchronized stochastic perturbations inject Gaussian noise into paired visual–textual embeddings, and step-wise similarities are aggregated to highlight the diffusion discrepancy of weakly noisy mismatches. During reverse diffusion, two complementary consistency objectives, i.e., intra‑modal structural consistency and cross‑modal semantic consistency, progressively purify and reconstruct weakly noisy correspondences into high-quality pairs for subsequent training cycles. Extensive experiments on benchmark datasets, including Flickr30K, MS-COCO, and Conceptual Captions, are conducted to demonstrate the superiority of DiffNCL over state-of-the-art baselines for cross-modal retrieval against noisy correspondences.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16025
Loading