Policy Search by Dynamic Programming

J. Andrew Bagnell, Sham Kakade, Andrew Y. Ng, Jeff G. Schneider

2003 (modified: 11 Nov 2022)NIPS 2003Readers: Everyone

Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a ﬁnite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem.

0 Replies