Keywords: Multi-agent Reinforcement Learning, Learning Awareness, Mixed Motive Games
TL;DR: Hidden gifts (unobserved cooperation) break MARL credit assignment; in Manitokan task, SOTA methods fail, while a decentralized learning-aware policy-gradient correction with action history reduces variance and achieves collective success.
Abstract: Sometimes we benefit from actions that others have taken even when we are
unaware that they took those actions. For example, if your neighbor chooses not
to take a parking spot in front of your house when you are not there, you can
benefit, even without being aware that they took this action. These “hidden gifts”
represent an interesting challenge for multi-agent reinforcement learning (MARL),
since assigning credit when the beneficial actions of others are hidden is non-trivial.
Here, we study the impact of hidden gifts with a very simple MARL task. In this
task, agents in a grid-world environment have individual doors to unlock in order
to obtain individual rewards. As well, if all the agents unlock their door the group
receives a larger collective reward. However, there is only one key for all of the
doors, such that the collective reward can only be obtained when the agents drop the
key for others after they use it. Notably, there is nothing to indicate to an agent that
the other agents have dropped the key, thus the act of dropping the key for others is
a “hidden gift”. We show that several different state-of-the-art MARL algorithms,
including MARL specific architectures, fail to learn how to obtain the collective
reward in this simple task. Interestingly, we find that decentralized actor-critic
policy gradient agents can solve the task when we provide them with information
about their own action history, but MARL agents still cannot solve the task with
action history. Finally, we derive a correction term for these policy gradient agents,
inspired by learning aware approaches, which reduces the variance in learning and
helps them to converge to collective success more reliably. These results show
that credit assignment in multi-agent settings can be particularly challenging in
the presence of “hidden gifts”, and demonstrate that self learning awareness in
decentralized agents can benefit these settings.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 14753
Loading