TutorTest: Evaluating Language Model-based Tutoring Policies Using Surrogate Tasks

Aishwarya Mandyam

TutorTest: Evaluating Language Model-based Tutoring Policies Using Surrogate Tasks

Aishwarya Mandyam

Published: 28 Sept 2025, Last Modified: 19 Oct 2025SEA @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: education, policy evaluation, reinforcement learning

TL;DR: We introduce a suite of surrogate tasks that improve the effective automatic evaluation of online tutoring policies.

Abstract: Recent advances in large language models (LLMs) have enabled the development of automated tutoring systems with the potential to deliver high-quality education at scale. These automated tutoring systems typically consist of language models that follow a pre-specified tutoring policy, which defines a strategy for responding to a user's utterance. However, evaluating these systems remains challenging and difficult to scale. Human raters and field studies are the gold standard for evaluation, but are time-consuimg and can be infeasible for jointly evaluating several tutoring policies. Automated reference based metrics that evaluate conversational coherence (e.g., BERTScore) are cheaper, but unreliable in evaluating the ability for tutoring policies to meaningfully help a student. To address this gap, we introduce TutorTest, an evaluation framework that consists of surrogate tasks which simulate interactions between a tutor and a student with a cognitive error. TutorTest evaluates the ability of a tutoring policy to help a simulated student overcome their cognitive errors, assigning higher value to more effective tutoring policies. Moreover, TutorTest requires substantially smaller datasets than those needed to finetune LLMs, making it feasible to operate with just few-shot samples from a limited tutoring dataset. In math and language learning experiments, TutorTest identifies the better tutoring strategy 97\% more often than baselines and evaluates policies at under 1\% the time cost of human studies. By enabling rapid, domain-grounded evaluations, our work provides a practical pathway for the development of tutoring systems. This work underscores the potential of large language models to generate meaningful surrogate evaluation tasks that are aligned with real-world outcomes.

Archival Option: The authors of this submission do *not* want it to appear in the archival proceedings.

Submission Number: 37

Loading