Keywords: benchmark, auto-eval, LLM-eval, evaluations, tutoring, dataset
TL;DR: A dataset and benchmark to evaluate tutoring capabilities of LLMs
Abstract: As students increasingly adopt large language models (LLMs) as learning aids,
it is crucial to build models that are adept at handling the nuances of tutoring:
they need to identify the core needs of students, be adaptive, provide personalized
guidance, and be accurate. To this end, we introduce TUTORBENCH, a dataset and
evaluation benchmark designed to rigorously evaluate the core tutoring skills of
LLMs. The dataset comprises 1,490 samples curated by human experts, focused
on high-school and AP-level curricula. The samples are drawn from three com-
mon tutoring tasks: (i) generating adaptive explanations tailored to a student’s
confusion, (ii) providing actionable feedback on a student’s work, and (iii) pro-
moting active learning through effective hint generation. To account for the inher-
ent complexity of tutoring, samples are accompanied by sample-specific rubrics
which are used to judge model responses during evaluation. TUTORBENCH uses
a reliable and fine-grained automatic evaluation method that uses an LLM-judge
and the sample-specific rubrics. We evaluate 16 frontier LLMs on TUTORBENCH
and present a detailed analysis of their performance and behavior. Our results
show that none of the frontier LLMs achieve a score of greater than 56%, show-
ing a large room for improvement. We find that LLMs fall short in exhibiting the
full range of tutoring skills needed to guide, diagnose, and support students effec-
tively, with all the frontier models achieving less than a 60% pass rate on rubric
criteria related to these skills. We also find that different model families exhibit
varied strengths and limitations: the Claude models outperform others in support-
ing active learning, while they lag behind in the other two use cases. By releasing
TUTORBENCH, we provide a comprehensive and unsaturated benchmark to guide
the development of the next-generation of AI tutors.
Primary Area: datasets and benchmarks
Submission Number: 3842
Loading