- Abstract: An algorithm is introduced for learning a predictive state representation with off-policy temporal difference (TD) learning that is then used to learn to steer a vehicle with reinforcement learning. There are three components being learned simultaneously: (1) the off-policy predictions as a compact representation of state, (2) the behavior policy distribution for estimating the off-policy predictions, and (3) the deterministic policy gradient for learning to act. A behavior policy discriminator is learned and used for estimating the important sampling ratios needed to learn the predictive representation off-policy with general value functions (GVFs). A linear deterministic policy gradient method is used to train the agent with only the predictive representations while the predictions are being learned. All three components are combined, demonstrated and evaluated on the problem of steering the vehicle from images in the TORCS racing simulator environment. Steering from only images is a challenging problem where evaluation is completed on a held-out set of tracks that were never seen during training in order to measure the generalization of the predictions and controller. Experiments show the proposed method is able to steer smoothly and navigate many but not all of the tracks available in TORCS with performance that exceeds DDPG using only images as input and approaches the performance of an ideal non-vision based kinematics model.
- Keywords: Predictive representations, general value functions, reinforcement learning, off-policy learning, behavior estimation
- TL;DR: An algorithm to learn a predictive state representation with general value functions and off-policy learning is applied to the problem of vision-based steering in autonomous driving.