Policy Advantage Networks
- Keywords: Reinforcement Learning, Offline Reinforcement Learning
- Abstract: An agents goal is to find policies that maximize its expected value, with the value function being a core component of many reinforcement learning algorithms. In practice, the value function isn't used by itself, but is instead used to form estimates of quantities that are more suitable for learning. These quantities commonly take the form of value differentials, allowing for the comparison of values between different states, actions and policies. In this paper, we propose a family of algorithms that focus directly on these value differentials, designing a critic that can predict the value differential between two policies given as inputs. Policy improvement can then be performed following the gradient of this differential critic with respect to the input policies. We further develop per time-step formulations of our algorithm, and show that it satisfies a differential Bellman Equation, under an augmented Markov Decision Process, allowing for the application of temporal-difference methods. We evaluate our algorithm in the online, offline and zero-shot learning settings, and show competitive performance in a range of control tasks.