SocraticEval: A Benchmark for Evaluating the Socratic Questioning Ability of LLMs in Dialogue Interaction
Keywords: Socratic Questioning, Large Language Models (LLMs) Evaluation, Benchmark Construction
Abstract: Socratic questioning is vital for fostering critical reasoning in domains like education. However, current methodologies lack effective frameworks to assess this capability in Large Language Models (LLMs). To bridge this gap, we propose \textsc{SocraticEval}, a benchmark that systematically decomposes this capability into \textit{Question Generation} and \textit{Strategy Utilization}. Leveraging our multi-domain dataset, \textsc{SocraticEnv}, we reveal a critical gap between theory and practice: state-of-the-art models exhibit limited strategic diversity, frequently devolving into mere rebuttal rather than constructive guidance, which undermines the method's intended value. Additionally, they show a pronounced deficiency in interrogating logical fallacies. To address this issue, we construct \textsc{SocraticPref}, a human preference dataset with ranked candidate questions, and apply Direct Preference Optimization (DPO), resulting in consistent improvements in fallacy-focused questioning.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: evaluation and metrics, task-oriented
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis, Position papers
Languages Studied: English, Chinese
Submission Number: 8582
Loading