Abstract: Learning multimodal policies is crucial for enhancing exploration in online reinforcement learning (RL), especially in tasks with continuous action spaces and non-convex reward landscapes. While recent diffusion policies show promise, they often suffer from low computational efficiency in online settings. A more training-efficient paradigm involves modeling the policy as a Boltzmann distribution and guiding the action sampling directly with the gradient of the Q-value with respect to the action (proportional to the score function of the policy), such as via Langevin dynamics. However, analysis in this paper reveals that this gradient-guided approach suffers from two critical challenges: sampling instability caused by the widely varying magnitude of action gradients; and mode imbalance, where the sampling process inaccurately represents the weights of different high-value action modes. To address these challenges, this paper introduces three targeted techniques: score normalization and reshaping to stabilize the sampling process, and value-based resampling to correct mode imbalance. These techniques are then integrated into an actor-critic framework, resulting in the Score-Enhanced Actor-Critic (SEAC) algorithm. Simulation and real-world experiments demonstrate that SEAC not only effectively learns multimodal behaviors but also achieves state-of-the-art performance and high computational efficiency compared to prior multimodal RL methods. The code of this paper is available at https://github.com/THUzouwenjun/SEAC.
External IDs:doi:10.1109/tase.2025.3631883
Loading