Towards Automated Evaluation of Socratic Tutoring: Introducing IndirectScore for Programming Education

ACL ARR 2025 May Submission3007 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The Socratic method is an effective pedagogy that uses open-ended questions to foster critical thinking and deeper understanding, but scaling it requires reliable evaluation of question quality, particularly in terms of indirectness. In this work, we propose IndirectScore, a preliminary automated metric for assessing the indirectness of a Socratic question by leveraging language model surprisal as a proxy for its subtlety. Our approach combines insights from linguistics, NLP, and education to evaluate whether a tutor's question appropriately guides students without being over-leading, while explicitly controlling for topical relevance to avoid confounding factors. In an initial evaluation using a newly constructed benchmark of 168 programming dialogues with expert-labeled question quality, IndirectScore shows promising alignment of over 71% with human judgments when distinguishing between clearly indirect and direct questions, outperforming traditional NLP metrics such as ROUGE-L and BERTScore. This work represents another step towards scalable evaluation of Socratic questioning, with implications for indirect communication assessment in other interdisciplinary NLP applications. Nonetheless, while these early results suggest potential for building robust AI tutoring systems, we highlight important limitations, such as limited datasets, noise signals, and domain generalisability, and provide directions for future work.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: evaluation and metrics, educational applications, interdisciplinary, multidisciplinary
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 3007
Loading