Comparing Deterministic and Soft Policy Gradients for Optimizing Gaussian Mixture Actors

TMLR Paper3180 Authors

14 Aug 2024 (modified: 12 Nov 2024)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Gaussian Mixture Models (GMMs) have been recently proposed for approximating actors in actor-critic reinforcement learning algorithms. Such GMM-based actors are commonly optimized using stochastic policy gradients along with an entropy maximization objective. In contrast to previous work, we define and study deterministic policy gradients for optimizing GMM-based actors. Similar to stochastic gradient approaches, our proposed method, denoted $\textit{Gaussian Mixture Deterministic Policy Gradient}$ (Gamid-PG), encourages policy entropy maximization. To this end, we define the GMM entropy gradient using $\textit{Variational Approximation}$ of the $KL$-divergence between the GMM's constituting Gaussians. We compare Gamid-PG with common stochastic policy gradient methods on benchmark dense-reward MuJoCo tasks and sparse-reward Fetch tasks. We observe that Gamid-PG outperforms stochastic gradient-based methods in 3/6 MuJoCo tasks while performing similarly on the remaining 3 tasks. In the Fetch tasks, Gamid-PG outperforms single-actor deterministic gradient-based methods while performing worse than stochastic policy gradient methods. Consequently, we conclude that GMMs optimized using deterministic policy gradients (1) should be favorably considered over stochastic gradients in dense-reward continuous control tasks, and (2) improve upon single-actor deterministic gradients.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Addressing reviewer "tB8P"'s concerns, we add the following clarifying text. In terms of the average performance, Gamid outperforms TD3 on 8/9 tasks (see Table 1). We also report an independent two-sample t-test (Cressie & Whitford, 1986) with the p-value significance level set to 0.05 comparing Gamid and TD3. The results indicate that the advantage of Gamid over TD3 is statistically significant in 4/9 tasks (‘HalfCheetah-v3’, ‘Walker2d-v3’, ‘FetchPush-v1’, and ‘FetchPickAndPlace-v1’). For the rest of the domains, the performance difference is not statistically significant.
Assigned Action Editor: ~Xingyou_Song1
Submission Number: 3180
Loading