Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

Published: 23 Sept 2025, Last Modified: 01 Dec 2025ARLETEveryoneRevisionsBibTeXCC BY 4.0
Track: Research Track
Keywords: policy parameterization, reparameterization, entropy regularization, actor-critic, policy optimization, exploration, continuous control, reinforcement learning
Abstract: Mixture policies in reinforcement learning offer greater flexibility compared to their base component policies. We demonstrate that this flexibility, in theory, enhances solution quality and improves robustness to the entropy scale. Despite these advantages, mixtures are rarely used in algorithms like Soft Actor-Critic, and the few empirical studies that are available do not show their effectiveness. One possible explanation is that base policies, like Gaussian policies, admit a reparameterization that enables low-variance gradient updates, whereas mixtures do not. To address this, we introduce a marginalized reparameterization (MRP) estimator for mixture policies that has provably lower variance than the standard likelihood-ratio (LR) estimator. We conduct extensive experiments across a large suite of synthetic bandits and environments from classic control, Gym MuJoCo, DeepMind Control Suite, MetaWorld, and MyoSuite. Our results show that mixture policies trained with our MRP estimator are more stable than the LR variant and are competitive compared to Gaussian policies across many benchmarks. In addition, our approach shows benefits when the critic surface is multimodal and in environments with unshaped rewards.
Submission Number: 132
Loading