MULTI-MODAL CONTRASTIVE TRAINING FOR ROBUST VQA

Mitra Tajrobehkar

MULTI-MODAL CONTRASTIVE TRAINING FOR ROBUST VQA

Mitra Tajrobehkar

Published: 25 Feb 2025, Last Modified: 25 Feb 2025MARW at AAAI 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: VQA-Robustness, Contrastive Loss, Adversarial perturbations

TL;DR: AdvCL improves Visual Question Answering (VQA) by refining model robustness through hard negatives and adversarial contrastive learning, enhancing consistency across linguistic variations and visual inputs

Abstract: This paper addresses the challenge of enhancing the robustness and efficiency of Visual Question Answering (VQA) models by leveraging feature consistency. Inspired by semi-supervised feature representation learning, we introduce a contrastive loss framework to effectively capture representations from multi-modal inputs. However, existing contrastive learning approaches, which use random intra-class and non-target samples as positive and negative examples, often fail to improve model performance on robust VQA benchmarks. To overcome this limitation, we propose Adversarial Contrastive Learning (ADVCL), a supervised framework that generates challenging positive and negative samples via adversarial perturbations. ADVCL creates hard positives by applying significant perturbations to input image-question pairs, thereby maximizing conditional likelihood and enhancing robustness. Experimental results demonstrate that ADVCL outperforms or matches state-of-the-art models in robustness against linguistic variations in questions, offering a significant advancement in VQA robustness.

Submission Number: 24

Loading