Adaptive Two-Level Quasi-Monte Carlo for Soft Actor-Critic

Du Ouyang; Zhenpeng Shi; Aodong Guo; Huaze Tang; Hejin Wang; Chao Wang; Wenbo Ding

Adaptive Two-Level Quasi-Monte Carlo for Soft Actor-Critic

Du Ouyang, Zhenpeng Shi, Aodong Guo, Huaze Tang, Hejin Wang, Chao Wang, Wenbo Ding

Published: 19 Jun 2024, Last Modified: 26 Jul 2024ARLET 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Quasi-Monte Carlo

TL;DR: We proposed an adaptive two-level quasi-Monte Carlo method to improve policy gradient estimation in Soft Actor Critic.

Abstract: In the framework of Actor-Critic, the policy gradient is often expressed in the form of an integral $\mathbb{E}\left[ h(X)\right]$. To estimate this integral with better convergence results, the quasi-Monte Carlo (QMC) method can be used in conjunction with the maximum sample size of $2^M$, and the resulting estimator $\widehat I_{2^M}^{\mathrm{QMC}}$ achieves an error rate of $O(2^{-M+\varepsilon})$ with an arbitrarily small $ \varepsilon >0$. However, such a large number of QMC points often results in a substantial computational cost. To address this issue, we propose an adaptive two-level quasi-Monte Carlo (ATQ) method for approximating $\mathbb{E}\left[ h(X)\right]$ with much fewer samples than $\widehat I_{2^M}^{\mathrm{QMC}}$. The ATQ method comprises two levels: the base level and the stochastic level. The base level employs large sample sizes to increase accuracy in the unstable phase of learning, and shifts to small sample sizes to save costs once stability is achieved. Within the stochastic level, we randomize the number of samples to ensure that the ATQ method is an unbiased estimator of $\widehat I_{2^M}^{\mathrm{QMC}}$. Theoretically, for the sample size $2^b$ of the base level, the ATQ method converges to $\mathbb{E}\left[ h(X)\right]$ at the rate of $ O(2^{-b+\varepsilon})$ with an arbitrarily small $ \varepsilon >0$, which is better than the Monte Carlo (MC) rate $ O(2^{-b/2})$. Experimentally, we compare the ATQ-based Soft Actor-Critic method with strong baselines in both online Mujoco environments and offline D4RL suboptimal datasets. Our approach achieves state-of-the-art performance, outperforming other on-policy and off-policy methods in most aforementioned online environments and offline datasets.

Submission Number: 43

Loading