GroundAttack: Mitigating Easy-Options Bias for Visual Question Answering

20 Sept 2025 (modified: 25 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Visual Question Answering, Shortcut learning, Vision Languame Models
TL;DR: We reveal and address an Easy-Options Bias in VQA benchmarks by introducing GroundAttack, which generates hard negative options to enable fairer evaluation of vision-language models.
Abstract: In this early study, we observe an Easy-Opitions Bias (EOB) issue in several multiple-choice Visual Question Answering (VQA) benchmarks, including MMStar, RealWorldQA, SEED-Bench, NeXT-QA, STAR benchmark, and Video-MME. This bias allows vision-language models (VLMs) to select the correct answer using only the vision ($\boldsymbol{V}$) and options ($\boldsymbol{O}$) as inputs, without the need for the question ($\boldsymbol{Q}$). Through grounding experiments, we attribute the bias to an imbalance in visual relevance: the correct answer typically aligns more closely with the visual contents than the negative options in feature space, creating a shortcut for VLMs to infer the answer via simply vision-option similarity matching. To mitigate this, we introduce GroundAttack, an agentical method that automatically generates hard negative options as visually plausible as the correct answer. We apply it to the NeXT-QA and MMStar datasets, creating new EOB-free annotations. On these EOB-free annotations, current VLMs approach random accuracies under ($\boldsymbol{V}$+$\boldsymbol{O}$) settings, and drop to non-saturated accuracies under ($\boldsymbol{V}$+$\boldsymbol{Q}$+$\boldsymbol{O}$) settings, providing a more realistic evaluation of VLMs' QA ability.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Submission Number: 22840
Loading