Teach2Eval: An Interaction-Driven LLMs Evaluation Method via Teaching Effectiveness

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: New Evaluation Method, Multi-dimensional Evaluation, Large Language Models, Data Contamination, Teach2Eval
Abstract: Recent progress in large language models (LLMs) has outpaced the development of effective evaluation methods. Evaluating LLMs with static, task-specific benchmarks is increasingly fragile due to contamination and saturation, and it fails to capture interactive reasoning. We introduce Teach2Eval, which reframes evaluation as teaching: a candidate model guides weaker students, and the students’ gains constitute the score. This interaction yields robustness to contamination and exposes orthogonal abilities with fine-grained metrics across Application, Judgment, Guidance, and Reflection. The framework scales automatically by exploiting natural error distributions from weak students, requiring neither bespoke rubrics nor human graders. Across 33 LLMs and 60 datasets, Teach2Eval achieves Spearman above 0.97 with human-preference leaderboards (e.g., Chatbot Arena/LiveBench), surpassing direct baselines, while offering actionable training signals (capability hierarchies, early overfitting) at low cost. We open-source our code and data at https://github.com/zhiqix/Teach2Eval.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8691
Loading