Reinforcement Teaching

Calarina Muslimani; Alex Lewandowski; Dale Schuurmans; Matthew E. Taylor; Jun Luo

Reinforcement Teaching

Calarina Muslimani, Alex Lewandowski, Dale Schuurmans, Matthew E. Taylor, Jun Luo

Published: 10 Jun 2023, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Machine learning algorithms learn to solve a task, but are unable to improve their ability to learn. Meta-learning methods learn about machine learning algorithms and improve them so that they learn more quickly. However, existing meta-learning methods are either hand-crafted to improve one specific component of an algorithm or only work with differentiable algorithms. We develop a unifying meta-learning framework, called \textit{Reinforcement Teaching}, to improve the learning process of \emph{any} algorithm. Under Reinforcement Teaching, a teaching policy is learned, through reinforcement, to improve a student's learning algorithm. To learn an effective teaching policy, we introduce the \textit{parametric-behavior embedder} that learns a representation of the student's learnable parameters from its input/output behavior. We further use \textit{learning progress} to shape the teacher's reward, allowing it to more quickly maximize the student's performance. To demonstrate the generality of Reinforcement Teaching, we conduct experiments in which a teacher learns to significantly improve both reinforcement and supervised learning algorithms. Reinforcement Teaching outperforms previous work using heuristic reward functions and state representations, as well as other parameter representations.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: All text changes are highlighted in blue color. We have revised Figures 4, 5, and 6. We have added Table 1 describing the teacher MDPs, and we have added Algorithm 1 to describe the teacher training protocol. April 6: Added details regarding parametric-behavior embedder's approximation ability in the linear setting (Appendix F) and new supervised learning experiment with the reward being just the accuracy (Appendix L.7). Note that the new experiment has no threshold and no termination condition. June 9: Included additional requested references, minor phrasing changes and removed highlighted differences

Code: https://www.dropbox.com/sh/hjkzzgctnqf6d8w/AAAYEycaDvPOeifz8FZbR3kLa?dl=0

Assigned Action Editor: ~Marcello_Restelli1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 519

Loading