One Policy Learns Them All: Synergizing Prior-Guided Exploitation and Online Exploration in Curriculum Based MARL

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: policy prior, multi-agent reinforcement learning, sample efficiency, dual-track curriculum learning
Abstract: Existing offline-to-online (O2O) multi-agent reinforcement learning (MARL) methods typically employ offline prior policies for warm-start initialization but are susceptible to distributional shifts and structural consistency constraints. On the other hand, prior-guided cold-start conditions, albeit of more practical interests, require a subtle synergy between utilizing prior-collected samples and self-exploring the state-action space. In this paper, we propose DUCE, a dual-track curriculum MARL algorithm that balances exploitation and exploration to ensure efficient, stable cold-start training. The curriculum designs include: (1) an externally configured task-difficulty curriculum that alternates between performing prior and online policies with probabilistic scheduling, progressively reducing the prior-guidance horizon to transition tasks from easy to hard, and (2) an internally evolving policy optimization curriculum that imposes a decaying offline RL regularizer on the online loss, enabling a smooth shift from conservative prior reliance to exploration-driven training. Extensive experiments on challenging StarCraft multi-agent challenge (SMAC) v1/v2 tasks demonstrate that DUCE achieves faster convergence and higher asymptotic performance, and consistently outperforms state-of-the-art warm-start baselines. Importantly, DUCE is agnostic to the architectures of priors (e.g., rule-based or RNN).
Primary Area: reinforcement learning
Submission Number: 9385
Loading