Abstract: Current Reinforcement Learning algorithms have reached new heights in performance. However, such algorithms often require hundreds of millions of samples, often resulting in policies that are unable to transfer between tasks without full retraining. Successor features aim to improve this situation by decomposing the policy into two components: one capturing environmental dynamics and the other modelling reward. Where the reward function is formulated as the linear combination of learned state features and a learned parameter vector. Under this form, transfer between related tasks now only requires training the reward component. In this paper, we propose a novel extension to the successor feature framework resulting in a natural second-order variant. After derivation of the new state-action value function, a second additive term emerges, this term predicts reward as a non-linear combination of state features while providing additional benefits. Experimentally, we show that this term explicitly models the environment's stochasticity and can also be used in place of $\epsilon$-greedy exploration methods during transfer. The performance of the proposed extension to the successor feature framework is validated empirically on a 2D navigation task, the control of a simulated robotic arm, and the Doom environment.
Supplementary Material: zip
5 Replies
Loading