TL;DR: We propose learning progress based self-paced curriculum learning for MARL tasks, addressing issues of existing return-based curriulum learning methods when applied to MARL.
Abstract: The number of agents can be an effective curriculum variable for controlling the difficulty of multi-agent reinforcement learning (MARL) tasks. Existing work typically uses manually defined curricula such as linear schemes. We identify two potential flaws while applying existing reward-based automatic curriculum learning methods in MARL: (1) The expected episode return used to measure task difficulty has high variance; (2) Credit assignment difficulty can be exacerbated in tasks where increasing the number of agents yields higher returns which is common in many MARL tasks. To address these issues, we propose to control the curriculum by using a TD-error based *learning progress* measure and by letting the curriculum proceed from an initial context distribution to the final task specific one. Since our approach maintains a distribution over the number of agents and measures learning progress rather than absolute performance, which often increases with the number of agents, we alleviate problem (2). Moreover, the learning progress measure naturally alleviates problem (1) by aggregating returns. In three challenging sparse-reward MARL benchmarks, our approach outperforms state-of-the-art baselines.
Lay Summary: Training teams of AI agents to work together is challenging, especially when they receive little feedback (reward) from the environment. Traditionally, researchers make the task easier at first — for example, by using fewer agents — and then slowly increase the difficulty. But this approach is often hand-crafted and doesn't always work well.
We explored whether an AI system could automatically decide how many agents to train with at each stage, based on how much it's learning. At first, we adapted an existing method that picks easier tasks by checking how much reward the agents get. But this method struggled: in multi-agent settings, high rewards don't always mean better learning due to the credit assignment problem.
To fix this, we proposed a new method that looks at how much the agents' policies improves over time — their "learning progress" — instead of just their rewards. This progress is measured using the TD error signals that are more stable and informative.
Our experiments on several complex benchmarks show that our method helps agents learn faster and more effectively than previous approaches. This work could improve how AI teams are trained in environments where feedback is rare.
Link To Code: https://github.com/wenshuaizhao/spmarl
Primary Area: Reinforcement Learning->Multi-agent
Keywords: Curriculum learning, Temporal difference, Sparse reward
Submission Number: 1858
Loading