Efficient Reinforcement Learning for Optimizing Multi-turn Student Outcomes with LLM Tutors

HyunJi Nam; Omer Gottesman; Amy Zhang; Dean Foster; Emma Brunskill; Lyle Ungar

Efficient Reinforcement Learning for Optimizing Multi-turn Student Outcomes with LLM Tutors

HyunJi Nam, Omer Gottesman, Amy Zhang, Dean Foster, Emma Brunskill, Lyle Ungar

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY-ND 4.0

Keywords: multi-turn RL for education, LLM-assisted tutoring

Abstract: Large language models (LLMs) built on existing reinforcement learning with human feedback (RLHF) frameworks typically optimize immediate responses at each turn. However, this can fail in multi-turn dialogue settings, like online math tutoring, where a single-turn optimal tutor may give away answers instead of guiding the student step by step. We introduce a method that enhances LLM-based tutors by representing the dialogue history with a lower-dimensional (student) state representation and optimizing a long-term policy to select high-level actions given that state. This better aligns the tutor with the long-term objective of helping the student solve the target math problem(s) independently. Our approach based on lower-dimensional states and high-level actions is more computationally efficient than training the tutor policy end-to-end to directly generate the tutor’s response. In LLM-simulated tutoring scenarios evaluated on GSM8K, our approach improves student’s long-term outcomes by 50% compared to prompting baselines.

Supplementary Material: pdf

Submission Number: 51

Loading