The Cliff of Overcommitment with Policy Gradient Step Sizes

Scott M. Jordan; Samuel Neumann; James E. Kostas; Adam White; Philip S. Thomas

The Cliff of Overcommitment with Policy Gradient Step Sizes

Scott M. Jordan, Samuel Neumann, James E. Kostas, Adam White, Philip S. Thomas

Published: 15 May 2024, Last Modified: 14 Nov 2024RLC 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Policy gradient, adaptive step sizes

TL;DR: Larger than optimal step sizes lead to the policy overcommitting to suboptimal actions.

Abstract: Policy gradient methods form the basis for many successful reinforcement learning algorithms, but their success depends heavily on selecting an appropriate step size and many other hyperparameters. While many adaptive step size methods exist, none are both free of hyperparameter tuning and able to converge quickly to an optimal policy. It is unclear why these methods are insufficient, so we aim to uncover what needs to be addressed to make an effective adaptive step size for policy gradient methods. Through extensive empirical investigation, the results reveal that when the step size is above optimal, the policy overcommits to sub-optimal actions leading to longer training times. These findings suggest the need for a new kind of policy optimization that can prevent or recover from entropy collapses.

Submission Number: 115

Loading