Keywords: Visual Question Answering, Ambiguity, Question Generation, Clarification
Abstract: Visual Question Answering (VQA) can suffer from under-specification, where the same image-question pair may have multiple plausible answers depending on missing external context. Existing research highlights this limitation, but does not provide methods for teaching models to proactively seek for context. In this work, we study the task of open-ended clarification question generation for underspecified VQA. We curate a dataset of ambiguous VQA pairs annotated with human-verified clarification questions that capture cultural, temporal, spatial, or attribute-based uncertainty. To address this task, we develop a reinforcement learning framework, Grounded Reasoning Preference Optimization--Clarification Reasoning (GRPO-CR), which integrates tailored reward functions to ensure generated clarifications are effective at resolving ambiguity. Experimental results show that GRPO-CR enables VLMs to ask clarification questions that more reliably reduce uncertainty. Our work establishes open-ended, context-seeking clarification as a principled pathway toward interactive, trustworthy multimodal systems that know when and what to ask before answering.
Submission Number: 176
Loading