EducationQ: Evaluating LLMs’ Teaching Capabilities through Multi-Agent Dialogue Framework

ACL ARR 2024 December Submission2334 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While Large Language Models (LLMs) demonstrate significant capabilities across domains, existing benchmarks focus primarily on knowledge and reasoning abilities, leaving a critical gap in evaluating their teaching capabilities—particularly in managing real-time instructional interactions and adapting pedagogical strategies to student needs. This paper introduces EducationQ, a novel multi-agent dialogue framework that systematically evaluates LLMs' teaching capabilities through dynamic informal formative assessment (IFA) scenarios. The framework employs a triadic interaction model comprising specialized teacher, student, and evaluator agents to capture the nuanced dynamics of educational exchanges. Using a curated dataset of 1,498 questions spanning multiple disciplines and difficulty levels, we evaluated 14 state-of-the-art LLMs. The findings challenge conventional assumptions that larger models or general capabilities inherently lead to superior teaching performance. Notably on GPQA Diamond, Teacher Llama 3.1 70B Instruct achieved significant student learning gains (12.63% improvement) through sophisticated questioning strategies, and Teacher Gemini 1.5 Pro 002 demonstrated robust performance (7.58% improvement) through adaptive feedback mechanisms—underscoring the importance of targeted teaching approaches. Quantitative metrics and qualitative dialogue analyses reveal that successful LLMs-as-teachers prioritize focused strategies and adaptive interactions aligned with established educational theories rather than broader knowledge repositories. The work contributes both a systematic framework for evaluating AI teaching capabilities and empirical insights for developing effective educational applications, bridging the gap between AI capabilities and educational needs.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: educational applications
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 2334
Loading