One Policy Learns Them All: Synergizing Prior-Guided Exploitation and Online Exploration in Curriculum Based MARL
Keywords: policy prior, multi-agent reinforcement learning, sample efficiency, dual-track curriculum learning
Abstract: Existing offline-to-online (O2O) multi-agent reinforcement learning (MARL) methods typically employ offline prior policies for warm-start initialization but are susceptible to distributional shifts and structural consistency constraints.
On the other hand, prior-guided cold-start conditions, albeit of more practical interests, require a subtle synergy between utilizing prior-collected samples and self-exploring the state-action space.
In this paper, we propose DUCE, a dual-track curriculum MARL algorithm that balances exploitation and exploration to ensure efficient, stable cold-start training.
The curriculum designs include:
(1) an externally configured task-difficulty curriculum that alternates between performing prior and online policies with probabilistic scheduling, progressively reducing the prior-guidance horizon to transition tasks from easy to hard, and
(2) an internally evolving policy optimization curriculum that imposes a decaying offline RL regularizer on the online loss, enabling a smooth shift from conservative prior reliance to exploration-driven training.
Extensive experiments on challenging StarCraft multi-agent challenge (SMAC) v1/v2 tasks demonstrate that DUCE achieves faster convergence and higher asymptotic performance, and consistently outperforms state-of-the-art warm-start baselines.
Importantly, DUCE is agnostic to the architectures of priors (e.g., rule-based or RNN).
Primary Area: reinforcement learning
Submission Number: 9385
Loading