Keywords: reinforcement learning, reinforcement learning theory, game theory
TL;DR: We analyze "training equilibria" of RL algorithms: the set of policies that an algorithm can converge to.
Abstract: In partially observable environments, reinforcement learning algorithms such as policy gradient and Q-learning may have multiple equilibria---policies that are stable under further training---and can converge to equilibria that are strictly suboptimal.
Prior work blames insufficient exploration, but suboptimal equilibria can arise despite full exploration and other favorable circumstances like a flexible policy parametrization.
We show theoretically that the core problem is that in partially observed environments, an agent's past actions induce a distribution on hidden states.
Equipping the policy with memory helps it model the hidden state and leads to convergence to a higher reward equilibrium, \emph{even when there exists a memoryless optimal policy}.
Experiments show that
policies with insufficient memory tend to learn to use the environment as auxiliary memory, and parameter noise helps policies escape suboptimal equilibria.
0 Replies
Loading