Revealing Stochastic Noise of Policy Entropy in Reinforcement Learning

ICLR 2026 Conference Submission16295 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-agent reinforcement learning, policy optimization, decentralized partially observable Markov decision process
Abstract: On-policy MARL remains attractive under non-stationarity but typically relies on a fixed entropy bonus that conflates useful exploration with stochastic fluctuations from concurrently learning agents. We present \emph{Policy Entropy Manipulation (PEM)}, a simple, drop-in alternative that treats entropy as a noisy measurement to be \emph{denoised} rather than uniformly maximized. PEM applies positive--negative momentum to \emph{entropy gradient} to form a high-pass, variance-controlled signal that preserves persistent exploratory trends while suppressing transient spikes. The method integrates seamlessly with heterogeneous, permutation-based on-policy algorithms, requires no critic or advantage changes, and uses a one-step warm start that recovers the standard entropy objective before momentum takes effect. Across benchmark tasks, PEM yields smoother and more stable learning curves, and improves coordination under heterogeneous observation/action spaces, resulting in stronger generalization than conventional entropy-regularized baselines. Our results indicate that noise-aware control of the entropy channel is an effective and principled way to stabilize exploration in cooperative MARL.
Primary Area: reinforcement learning
Submission Number: 16295
Loading