PEBBLE: A Pedagogical and SRL-Aware Benchmark for Evaluating LLM Tutors

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM tutors, multi-turn tutoring, scaffolding, diagnostic questioning, misconception repair, metacognitive support, affective support, self-regulated learning (SRL), rubric, overhelping penalty, LLM-as-judge, contamination controls, templated item generation, paraphrase-shift splits, cross-domain STEM benchmark, lifecycle evaluation
Abstract: Large language models are increasingly used as tutors, yet most evaluations measure what models know rather than how they teach. We present PEBBLE, an initial, compact, plug-and-play benchmark for multi–turn tutoring that scores five process-level dimensions grounded in the learning sciences—scaffolding, diagnostic questioning, misconception repair, metacognitive support, and affective support. PEBBLE formalizes a weighted per-turn scoring functional with an explicit overhelping penalty and an LLM-as-judge, and incorporates contamination controls via templated item generation and paraphrase-shift splits. We evaluate eight contemporary models across four STEM domains (30 seeds/domain; 240 simulated episodes/model) using simulated students in short, text-only dialogues; findings should be interpreted under these conditions. PEBBLE consistently surfaces deficits in diagnostic questioning and misconception repair despite near-ceiling affect and metacognition, and supports lifecycle analyses (scaling, post-training). Our contributions are: (i) a formal, SRL-aware rubric and scoring functional for multi–turn tutoring; (ii) a contamination-aware evaluation protocol with an LLM-as-judge; (iii) a cross-domain benchmark and open evaluation kit for reproducible lifecycle studies; and (iv) an empirical characterization of dimension-wise headroom that identifies diagnosis/repair as primary levers for improving tutoring quality. Code, seeds, personas, judge prompts, and a leaderboard specification will be released upon acceptance.
Submission Number: 113
Loading