RAG-ConfusionQA: A Benchmark for Evaluating LLMs on Confusing Questions

Zhiyuan Peng, Jinming Nian, Alexandre V. Evfimievski, Yi Fang

Published: 01 Jan 2024, Last Modified: 25 Sept 2025CoRR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Large Language Models (LLMs) are widely used in Conversational AI systems to generate responses to user inquiries. However, many natural questions lack well-defined answers. While existing studies primarily focus on question types such as false premises, they often overlook out-of-scope questions, where the provided document is semantically highly similar to the query but does not contain the required answer. In this paper, we propose a guided hallucination-based method to efficiently generate a diverse set of out-of-scope questions from a given document corpus. We then evaluate multiple LLMs based on their effectiveness in confusion detection and appropriate response generation. Furthermore, we introduce an improved method for detecting such out-of-scope questions, enhancing the reliability of LLM-based question-answering systems.