Keywords: information-seeking dialogue, multi-turn evaluation, benchmark datasets, large language models, dialogue systems
TL;DR: A benchmark for evaluating how LLMs handle ambiguous, open-ended requests through dialogue, revealing that current models fail to exhibit effective information-seeking behavior.
Abstract: While large language models excel at following explicit instructions, they often fail to handle ambiguous requests, defaulting to generic responses rather than seeking clarification. To assess this desired capability in language models we present InfoQuest, a multi-turn chat benchmark that evaluates how models uncover hidden context through sequential interactions. The benchmark uses ambiguous scenarios that require models to ask clarifying questions before providing a response. Analogous to reinforcement learning's sequential optimization, we measure success with conversational reward signals derived from user satisfaction and information discovery. Our evaluation shows all models struggle with information-seeking. Proprietary models perform better but still require excessive turns and frequently reverting to generic responses. We provide methodology for generating scenarios and evaluating capabilities through reward-free interactions and implicit feedback signals, revealing fundamental limitations in handling ambiguity and highlighting the need for training approaches that optimize long-term dialogue outcomes beyond traditional reward maximization.
Submission Number: 20
Loading