Keywords: reinforcement learning, policy optimization, linear MDP, general function approximation
Abstract: While policy optimization algorithms have demonstrated remarkable empirical success in reinforcement learning (RL) tasks, their theoretical analysis is limited compared to those of value-based algorithms. In this paper, we address the gap by proposing a new provably efficient policy optimization algorithm that incorporates optimistic value estimation and rare policy switches. For linear Markov decision processes (MDPs), our algorithm achieves a regret bound of $\tilde{O}(d^2 H^2 \sqrt{T})$, which is the sharpest regret bound of a policy optimization algorithm for linear MDPs. Furthermore, we extend our algorithm to general function approximation and establish a regret bound of $\tilde{O}(\sqrt{T})$. To our best knowledge, this is the first regret guarantee of a policy optimization algorithm with general function approximation. Numerical experiments demonstrate that our algorithm has competitive regret performances compared to the existing RL algorithms while also being computationally efficient, supporting our theoretical claims.
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7478
Loading