SocraticEval: A Benchmark for Evaluating the Socratic Questioning Ability of LLMs in Dialogue Interaction

SocraticEval: A Benchmark for Evaluating the Socratic Questioning Ability of LLMs in Dialogue Interaction

ACL ARR 2026 January Submission8582 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Socratic Questioning, Large Language Models (LLMs) Evaluation, Benchmark Construction

Abstract: Socratic questioning is vital for fostering critical reasoning in domains like education. However, current methodologies lack effective frameworks to assess this capability in Large Language Models (LLMs). To bridge this gap, we propose \textsc{SocraticEval}, a benchmark that systematically decomposes this capability into \textit{Question Generation} and \textit{Strategy Utilization}. Leveraging our multi-domain dataset, \textsc{SocraticEnv}, we reveal a critical gap between theory and practice: state-of-the-art models exhibit limited strategic diversity, frequently devolving into mere rebuttal rather than constructive guidance, which undermines the method's intended value. Additionally, they show a pronounced deficiency in interrogating logical fallacies. To address this issue, we construct \textsc{SocraticPref}, a human preference dataset with ranked candidate questions, and apply Direct Preference Optimization (DPO), resulting in consistent improvements in fallacy-focused questioning.

Paper Type: Long

Research Area: Dialogue and Interactive Systems

Research Area Keywords: evaluation and metrics, task-oriented

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis, Position papers

Languages Studied: English, Chinese

Submission Number: 8582

Loading