QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

Belinda Z. Li; Been Kim; Zi Wang

QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

Belinda Z. Li, Been Kim, Zi Wang

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: language models, question asking, constraint satisfaction problems, underspecification

Abstract: Large language models (LLMs) have shown impressive performance on reasoning benchmarks like math and logic. While many works have largely assumed well-defined tasks, real-world queries are often underspecified and only solvable by acquiring missing information. We formalize this information-gathering problem as a constraint satisfaction problem (CSP) with missing variable assignments. Using a special case where only one necessary variable assignment is missing, we can evaluate an LLM's ability to identify the minimal necessary question to ask. We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with partially-observed initial states, (3) GSM-Q: human-annotated grade school math problems with one unknown variable, and (4) GSME-Q: equation-based version of GSM-Q. The LLM must select the correct clarification question from multiple options. While current models excel at GSM-Q and GSME-Q, they achieve only 40-50% accuracy on Logic-Q and Planning-Q. Analysis shows that the ability to solve well-specified reasoning problems is not sufficient for success on our benchmark: models struggle to identify the right question even when they can solve the fully specified version. This highlights the need for specifically optimizing models' information acquisition capabilities.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/belindazli/QuestBench

Code URL: https://github.com/google-deepmind/questbench

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 757

Loading