Automating pedagogical evaluation of LLM-based conversational agents

Published: 23 Jun 2025, Last Modified: 23 Jun 2025Greeks in AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: automated evaluation, Socratic dialogue, AI tutor
TL;DR: Automating AIED dialogue evaluation using LLM-as-a-judge approach finding they match human judgment on surface features but struggle with subtle teaching behaviours.
Abstract: With the growing adoption of large language models (LLMs) in educational settings, there is an urgent need for systematic and scalable evaluation methods. Traditional natural language generation metrics such as BLEU, ROUGE and METEOR excel at measuring surface‐level linguistic quality but fall short in evaluating the interactive, adaptive nature of dialogue alignment of conversational agents, particularly in relation to their intended design. To address these gaps, we propose an evaluation strategy that extends beyond technical evaluation (linguistic coherence and semantic relevance). In this pilot study we compare human and LLM-based evaluation of a conversational agent, with a focus on Socratic dialogue as a specific instantiation. Early results indicate that our LLM-as-a-Judge aligns closely to human evaluators for clear, surface‐level qualities like encouragement and actionable guidance, but less on subtle pedagogical behaviours such as recognising errors and maintaining natural dialogue flow. These early results underscore the promise of LLM-based evaluators for scalable assessment of tutoring behaviours while highlighting the need for targeted fine-tuning and hybrid approaches to improve nuanced error detection and dialogue coherence.
Submission Number: 142
Loading