Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

ICLR 2026 Conference Submission18013 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-Agent Reinforcement Learning, Value Decomposition, Centralized Training with Decentralized Execution, Exploration
TL;DR: S2Q stores values of multiple subactions, enabling efficient adjustment when the optimality of value function shifts via exploration.
Abstract: Value decomposition has been extensively studied as a core approach for cooperative multi-agent reinforcement learning (MARL) under the CTDE paradigm. Despite this progress, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub-value Q-learning (S2Q), a framework that successively learns multiple sub-value functions to retain information about alternative high-value actions. By incorporating these sub-value functions into a Softmax-based behavior policy, S2Q encourages persistent exploration and enables $Q^{\text{tot}}$ to adjust quickly when the optimal action changes. Extensive experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 18013
Loading