Keywords: Reinforcement Learning, Policy Gradient, Non-convex Optimization
TL;DR: We present pratical (stochastic) softmax policy gradient methods that do not require oracle-like knowledge to set algorithmic parameters compared to prior work while still maintaining the same theoretical guarentees.
Abstract: We consider (stochastic) softmax policy gradient (PG) methods for finite Markov Decision Processes (MDP). While the PG objective is not concave, recent research has used smoothness and gradient dominance to achieve convergence to an optimal policy. However, these results depend on having extensive knowledge of the environment, such as the optimal action or the true mean reward vector, to configure the algorithm parameters. This makes the resulting algorithms impractical in real applications. To alleviate this problem, we propose PG methods that employ an Armijo line-search in the deterministic setting and an exponentially decreasing step-size in the stochastic setting. We demonstrate that these proposed algorithms offer similar theoretical guarantees as previous works but now do not require the knowledge of oracle-like quantities. Furthermore, we apply the similar techniques to develop practical, theoretically sound entropy-regularized methods for both deterministic and stochastic settings. Finally, we empirically compare the proposed methods with previous approaches in single-state MDP environments.
Submission Number: 101
Loading