Keywords: Reinforcement Learning, Off-policy, Representation Learning
Abstract: Off-policy Reinforcement Learning (RL) is fundamental to realizing intelligent decision-making agents by trial and error.
The most notorious issue of off-policy RL is known as Deadly Triad, i.e., Bootstrapping, Function Approximation, and Off-policy Learning.
Despite recent advances in bootstrapping algorithms with better bias control, improvements on the latter two factors are relatively less studied. In this paper, we propose a general off-policy RL algorithm based on policy representation and policy-extended value function approximator (PeVFA). Orthogonal to better bootstrapping, our improvement is two-fold. On one hand, PeVFA's nature in fitting the value functions of multiple policies according to corresponding low-dimensional policy representation offers preferable function approximation with less interference and better generalization. On the other hand, PeVFA and policy representation allow to perform off-policy learning in a more general and sufficient manner. Specifically, we perform additional value learning for proximal historical policies along the learning process.
This drives the value generalization from learned policies and in turn, leads to more efficient learning. We evaluate our algorithms on continuous control tasks and the empirical results demonstrate consistent improvements in terms of efficiency and stability.
Submission Number: 59
Loading