GQA-Q2Q: A Large-scale Dataset for Resolving Entity Ambiguity in Visual Question-Answering via Clarifying Subquestion

ICLR 2026 Conference Submission22647 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Visual Question-Answering, Entity Ambiguity, Question Clarification, Subquestion Generation, Benchmark Dataset
Abstract: Vision-Language Models (VLMs) have achieved remarkable results on various visual question-answering (VQA) benchmarks. However, their performance is significantly impacted by ambiguous questions in which the target entity in the image is not clearly identified. To address and evaluate this issue, it is essential to create a dedicated benchmark dataset that aligns ambiguous questions with a clarifying subquestion. However, constructing a large, high-quality benchmark dataset is costly, particularly when it relies on expert annotations. To efficiently construct such a dataset at scale, this paper presents a hybrid human-machine pipeline. This pipeline begins by generating a small initial set of subquestions using rule-based templates, which are then refined through human annotation. This initial annotated set serves as the foundation for training a subquestion generator and a validator, and the generator and the validator together allow automated construction of a large-scale dataset. As a result, this paper presents a new large-scale dataset, GQA-Q2Q, designed to disambiguate unclear entities in questions by providing clarifying subquestions. Furthermore, a VQA framework is introduced which utilizes the clarifying subquestions to resolve ambiguity before producing a final answer. The experimental results demonstrate that this approach significantly enhances VQA performance, validating the effectiveness of the proposed dataset.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 22647
Loading