EduPersona: Benchmarking Subjective Ability Boundaries of Virtual Student Agents

18 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI4Education; Virtual Student Agents; Subjective Ability Evaluation; Benchmark and Dataset
Abstract: As large language models (LLMs) are increasingly integrated into education, virtual student agents are becoming essential for classroom simulation and teacher training. However, their classroom-oriented subjective abilities remain largely unassessed, limiting our understanding of model boundaries and hindering trustworthy deployment. We introduce EduPersona, a large-scale benchmark spanning two languages, three subjects, and ten persona types grounded in the Big Five theory. The dataset contains 1,308 authentic classroom dialogue rounds (12,814 teacher–student Q&A turns) and is expanded via persona stylization to roughly 10× scale (128k turns), forming a comprehensive foundation for evaluation. Building on this resource, we decompose subjective performance into three progressive tasks: (Task 1) basic coherence—alignment between behavior, emotion, expression, voice, and classroom context; (Task 2) student realism; and (Task 3) long-term persona consistency, establishing an evaluation framework grounded in educational theory and empirical validity. We conduct systematic experiments on three representative LLM families, comparing their base versions with ten EduPersona fine-tuned variants. Results show consistent and substantial improvements: +33.6% in Task 1, +30.6% in Task 2, and +14.9% in Task 3, highlighting both the benchmark’s effectiveness and the heterogeneous difficulty of persona modeling. A human–AI alignment experiment further confirms that GPT-4o’s judgments closely match expert consensus. In summary, EduPersona provides the first classroom benchmark centered on subjective student abilities, establishes a decoupled and verifiable paradigm for evaluating virtual learners, and will be open-sourced to support the development of trustworthy and human-like AI for education.
Primary Area: datasets and benchmarks
Submission Number: 10933
Loading