MANGO: Enhancing the Robustness of VQA Models via Adversarial Noise GenerationDownload PDF

Anonymous

16 Jan 2022 (modified: 05 May 2023)ACL ARR 2022 January Blind SubmissionReaders: Everyone
Abstract: Large-scale pre-trained vision-and-language (V+L) transformers have propelled the state of the art (SOTA) on Visual Question Answering (VQA) task. Despite impressive performance on the standard VQA benchmark, it remains unclear how robust these models are. To investigate, we conduct a host of evaluations over 4 different types of robust VQA datasets: (i) Linguistic Variation; (ii) Logical Reasoning; (iii) Visual Content Manipulation; and (iv) Answer Distribution Shift. Experiments show that pre-trained V+L models already exhibit better robustness than many task-specific SOTA methods via standard model finetuning. To further enhance model robustness, we propose Mango, a generic and efficient approach that learns a Multimodal Adversarial Noise GeneratOr in the embedding space to fool V+L models. Differing from previous studies focused on one specific type of robustness, Mango is agnostic to robustness types, and enables universal performance lift for both task-specific and pre-trained models over diverse robust VQA datasets designed to evaluate broad aspects of robustness. Comprehensive experiments demonstrate that Mango achieves new SOTA on 7 out of 9 robustness benchmarks.
Paper Type: long
0 Replies

Loading