TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models

11 Sept 2025 (modified: 25 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: benchmark, auto-eval, LLM-eval, evaluations, tutoring, dataset
TL;DR: A dataset and benchmark to evaluate tutoring capabilities of LLMs
Abstract: As students increasingly adopt large language models (LLMs) as learning aids, it is crucial to build models that are adept at handling the nuances of tutoring: they need to identify the core needs of students, be adaptive, provide personalized guidance, and be accurate. To this end, we introduce TUTORBENCH, a dataset and evaluation benchmark designed to rigorously evaluate the core tutoring skills of LLMs. The dataset comprises 1,490 samples curated by human experts, focused on high-school and AP-level curricula. The samples are drawn from three com- mon tutoring tasks: (i) generating adaptive explanations tailored to a student’s confusion, (ii) providing actionable feedback on a student’s work, and (iii) pro- moting active learning through effective hint generation. To account for the inher- ent complexity of tutoring, samples are accompanied by sample-specific rubrics which are used to judge model responses during evaluation. TUTORBENCH uses a reliable and fine-grained automatic evaluation method that uses an LLM-judge and the sample-specific rubrics. We evaluate 16 frontier LLMs on TUTORBENCH and present a detailed analysis of their performance and behavior. Our results show that none of the frontier LLMs achieve a score of greater than 56%, show- ing a large room for improvement. We find that LLMs fall short in exhibiting the full range of tutoring skills needed to guide, diagnose, and support students effec- tively, with all the frontier models achieving less than a 60% pass rate on rubric criteria related to these skills. We also find that different model families exhibit varied strengths and limitations: the Claude models outperform others in support- ing active learning, while they lag behind in the other two use cases. By releasing TUTORBENCH, we provide a comprehensive and unsaturated benchmark to guide the development of the next-generation of AI tutors.
Primary Area: datasets and benchmarks
Submission Number: 3842
Loading