Meta Learning the Step Size in Policy Gradient MethodsDownload PDF

Published: 14 Jul 2021, Last Modified: 05 May 2023AutoML@ICML2021 PosterReaders: Everyone
Keywords: Meta Reinforcement Learning, Hyperparameter tuning, Policy Gradient
Abstract: Policy-based algorithms are among the most widely adopted techniques in model-free RL, thanks to their strong theoretical groundings and good properties in continuous action spaces. Unfortunately, these methods require precise and problem-specific hyperparameter tuning to achieve good performance and, as a consequence, they tend to struggle when asked to accomplish a series of heterogeneous tasks. In particular, the selection of the step size has a crucial impact on the ability to learn a highly performing policy, affecting the speed and the stability of the training process, and often being the main culprit for poor results. In this paper, we tackle these issues with a Meta Reinforcement Learning approach, by introducing a new formulation, known as meta-MDP, that can be used to solve any hyperparameter selection problem in RL with contextual processes. After providing a theoretical Lipschitz bound to the performance in different tasks, we adopt the proposed framework to train a batch RL algorithm to dynamically recommend the most adequate step size for different policies and tasks. In conclusion, we present an experimental campaign to show the advantages of selecting an adaptive learning rate in heterogeneous environments.
Ethics Statement: Hyperparameter selection for policy based algorithms has a relevant impact on the ability to learn a highly performing policy in Reinforcement Learning, especially with heterogeneous tasks, where different contexts may require different solutions. Our approach shows that it is able to learn an automatic selection of the best configurations that can be identified after a manual fine tuning of the parameters. Consequently, our work can be seen as a small step in the AutoML direction, in which a practitioner could run the algorithm and, with some guidance, obtain optimal performance in few steps without the need of manual fine tuning. Beyond this, we are notaware of any societal consequences of our work, such as on welfare, fairness, or privacy.
Crc Pdf: pdf
Poster Pdf: pdf
Original Version: pdf
3 Replies