Evaluating Information Gathering Abilities of Large Language Models with QuestBench

Belinda Z. Li; Been Kim; Zi Wang

Evaluating Information Gathering Abilities of Large Language Models with QuestBench

Belinda Z. Li, Been Kim, Zi Wang

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: information gathering, question asking, language model, evaluation, benchmarks

Abstract: Large language models (LLMs) have mastered a wide range of reasoning tasks, with an underlying assumption that the tasks are well-specified for LLMs to reach solutions. In reality, queries and instructions to LLMs often contain incomplete or underspecified information. Therefore, LLMs need to be able to actively ac-quire missing information by asking clarifying questions, ideally seeking the minimally sufficient piece of information. To assess whether LLMs possess this ability, we construct QUESTBENCH, a set of underspecified reasoning tasks that can be solved by asking at most a single question. We frame the tasks as constraint satisfaction problems with missing variable assignments, where the exact model response cannot be determined unless certain variables’ values are acquired. This framework specifically targets tasks where uncertainty stems from missing information, rather than semantic ambiguity in language. QUESTBENCH includes (1) Logic-Q: Logical reasoning tasks where one proposition is missing, (2) Planning-Q: PDDL planning problems where the initial state is partially observed, and (3) GSM-Q: Grade school math problems where one variable assignment is missing. Each task presents multiple choices of possible questions, only one of which is correct. We evaluate Gemini and GPT-4o models and find that they achieve 20 – 30% accuracy in both zero-shot and few-shot settings. When evaluating GPT-4-o1on a subset of our data, we find that it is only 41 – 44% accurate, despite using state-of-the-art inference-time reasoning techniques. When investigating characteristics of QuestBench, we find that LLMs struggle with tasks that are computationally expensive for traditional search-based CSP solvers. Our analyses reveal a negative correlation between LLM accuracy and solver runtime complexity, suggesting that LLMs may share similar limitations to CSP solvers

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12191

Loading