Using Assistance Rewards Without Introducing Bias: Overcoming Sparse Rewards in Multi-Agent Reinforcement Learning
Abstract: Reinforcement learning agents may fail to learn good policies when their reward function is too sparse.
Auxiliary reward shaping functions can help guide exploration towards the true rewards, but risk producing sub-optimal policies as agents now target a modified objective function.
Our paper addresses this challenge by introducing a general framework for incorporating auxiliary reward functions without introducing a bias in the true objective.
Agents train an ensemble of reward-function-specific policies, sharing experiences collected with one policy to all other policies in the ensemble.
A top-level control policy then learns to choose the best policy to maximize the true objective.
We show that this scheme does not affect the convergence properties of the underlying reinforcement learning algorithm, while avoiding potential biasing of the agent's objective.
We also adapted our proposed algorithm using off-policy PPO with MA-Trace correction for state value estimation. To our knowledge, this is the first work to adapt off-policy PPO in a multi-agent setting. We also demonstrate that our approach operates effectively with various assistance reward designs, removing the need for detailed reward function crafting or fine-tuning.
Loading