Track: Track 1: Papers with IEEE/CVF Workshop Proceedings
Keywords: Vision–Language Pretraining, Distractor Augmentation, Cross-Modal Grounding, Multimodal Learning, Curriculum Learning, Contrastive Learning, Image–Text Retrieval, VQA
TL;DR: VISDA improves vision–language pretraining by adding structured visual, textual, and relational distractors in a curriculum, enabling models to resolve subtle cross-modal ambiguities and boosting downstream VLP tasks.
Abstract: We propose \textbf{VISDA} (\textbf{Vis}ion--Language Pretraining with \textbf{D}istractor \textbf{A}ugmentation), a vision--language pretraining method that enhances cross-modal grounding by introducing structured, semantically plausible but incorrect training examples.
Unlike prior work relying on random in-batch negatives or token masking alone, VISDA constructs three complementary distractor families---\emph{visual}, \emph{textual}, and \emph{relational}---and organizes them in a curriculum of increasing difficulty.
A dedicated distractor classification objective, combined with standard contrastive and image--text matching losses, forces the model to resolve subtle cross-modal ambiguities that random negatives cannot provide.
In our experiments, 3-epoch fine-tuning from a CLIP/BERT initialization yields ARO compositional reasoning accuracy of 53.9\%, VQA~v2 at 24.9\%, NLVR2 at 50.0\%.
Ablation and difficulty-schedule analyses show detectable benefits from the curriculum even during fine-tuning: the curriculum schedule achieves 25.21\% VQA accuracy vs.\ 24.88\% for fixed-difficulty schedules.
Submission Number: 22
Loading