Vision–Language Pretraining with Structured Distractor Augmentation

Published: 25 Mar 2026, Last Modified: 28 May 2026CVPR 2026 Workshop CogVL PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 1: Papers with IEEE/CVF Workshop Proceedings
Keywords: Vision–Language Pretraining, Distractor Augmentation, Cross-Modal Grounding, Multimodal Learning, Curriculum Learning, Contrastive Learning, Image–Text Retrieval, VQA
TL;DR: VISDA improves vision–language pretraining by adding structured visual, textual, and relational distractors in a curriculum, enabling models to resolve subtle cross-modal ambiguities and boosting downstream VLP tasks.
Abstract: We propose \textbf{VISDA} (\textbf{Vis}ion--Language Pretraining with \textbf{D}istractor \textbf{A}ugmentation), a vision--language pretraining method that enhances cross-modal grounding by introducing structured, semantically plausible but incorrect training examples. Unlike prior work relying on random in-batch negatives or token masking alone, VISDA constructs three complementary distractor families---\emph{visual}, \emph{textual}, and \emph{relational}---and organizes them in a curriculum of increasing difficulty. A dedicated distractor classification objective, combined with standard contrastive and image--text matching losses, forces the model to resolve subtle cross-modal ambiguities that random negatives cannot provide. In our experiments, 3-epoch fine-tuning from a CLIP/BERT initialization yields ARO compositional reasoning accuracy of 53.9\%, VQA~v2 at 24.9\%, NLVR2 at 50.0\%. Ablation and difficulty-schedule analyses show detectable benefits from the curriculum even during fine-tuning: the curriculum schedule achieves 25.21\% VQA accuracy vs.\ 24.88\% for fixed-difficulty schedules.
Submission Number: 22
Loading