Training Equilibria in Reinforcement LearningDownload PDF

08 Oct 2022 (modified: 05 May 2023)Deep RL Workshop 2022Readers: Everyone
Keywords: reinforcement learning, reinforcement learning theory, game theory
TL;DR: We analyze "training equilibria" of RL algorithms: the set of policies that an algorithm can converge to.
Abstract: In partially observable environments, reinforcement learning algorithms such as policy gradient and Q-learning may have multiple equilibria---policies that are stable under further training---and can converge to equilibria that are strictly suboptimal. Prior work blames insufficient exploration, but suboptimal equilibria can arise despite full exploration and other favorable circumstances like a flexible policy parametrization. We show theoretically that the core problem is that in partially observed environments, an agent's past actions induce a distribution on hidden states. Equipping the policy with memory helps it model the hidden state and leads to convergence to a higher reward equilibrium, \emph{even when there exists a memoryless optimal policy}. Experiments show that policies with insufficient memory tend to learn to use the environment as auxiliary memory, and parameter noise helps policies escape suboptimal equilibria.
0 Replies