Keywords: deep reinforcement learning, instability, Hamiltonian policy gradient, stationary, quantum K-spin
Abstract: A foundational issue in deep reinforcement learning (DRL) is that \textit{Bellman's optimality equation has multiple fixed points}---failing to return a consistent one. A direct evidence is the instability of existing DRL algorithms, namely, the high variance of cumulative rewards over multiple runs. As a fix of this problem, we propose a quantum K-spin Hamiltonian regularization term (H-term) to help a policy network stably find a \textit{stationary} policy, which represents the lowest energy configuration of a system. First, we make a novel analogy between a Markov Decision Process (MDP) and a \textit{quantum K-spin Ising model} and reformulate the objective function into a quantum K-spin Hamiltonian equation, a functional of policy that measures its energy. Then, we propose a generic actor-critic algorithm that utilizes the H-term to regularize the policy/actor network and provide Hamiltonian policy gradient calculations. Finally, on six challenging MuJoCo tasks over 20 runs, the proposed algorithm reduces the variance of cumulative rewards by $65.2\% \sim 85.6\%$ compared with those of existing algorithms.
TL;DR: Apply a quantum K-spin Hamiltonian equation as a regularier and obtain a new actor-critic algorithm that finds a physically stationary policy.
Supplementary Material: zip
27 Replies
Loading