Mean Field Langevin Actor-Critic: Faster Convergence and Global Optimality beyond Lazy Learning

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: learning theory
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: policy gradient method, temporal-difference learning, actor-critic, global optimality, linear convergence, neural network, mean field, feature learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: We study how deep reinforcement learning algorithms learn meaningful features when optimized for finding the optimal policy. In particular, we focus on a version of the neural actor-critic algorithm where both the actor and critic are represented by over-parameterized neural networks in the mean-field regime, and are updated via temporal-difference (TD) and policy gradient respectively. Specifically, for the critic neural network to perform policy evaluation, we propose $\textit{mean-field Langevin TD learning}$ method (MFLTD), an extension of the mean-field Langevin dynamics with proximal TD updates, and compare its effectiveness against existing methods through numerical experiments. In addition, for the actor neural network to perform policy updates, we propose $\textit{mean-field Langevin policy gradient}$ (MFLPG), which implements policy gradient in the policy space through a version of Wasserstein gradient flow in the space of network parameters. We prove that MFLTD finds the correct value function, and the sequence of actors created by MFLPG created by the algorithm converges linearly to the globally optimal policy of the Kullback Leibler divergence regularized objective. To our best knowledge, we provide the first linear convergence guarantee for neural actor-critic algorithms with $\textit{global optimality}$ and $\textit{feature learning}$.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6885
Loading