Keywords: multi-turn RL for education, LLM-assisted tutoring
Abstract: Large language models (LLMs) built on existing reinforcement learning with human feedback (RLHF) frameworks typically optimize immediate responses at each turn. However, this can fail in multi-turn dialogue settings, like online math tutoring, where a single-turn optimal tutor may give away answers instead of guiding the student step by step. We introduce a method that enhances LLM-based
tutors by representing the dialogue history with a lower-dimensional (student) state representation and optimizing a long-term policy to select high-level actions given that state. This better aligns the tutor with the long-term objective of helping the student solve the target math problem(s) independently. Our approach based on lower-dimensional states and high-level actions is more computationally efficient
than training the tutor policy end-to-end to directly generate the tutor’s response. In LLM-simulated tutoring scenarios evaluated on GSM8K, our approach improves student’s long-term outcomes by 50% compared to prompting baselines.
Supplementary Material: pdf
Submission Number: 51
Loading