Keywords: Value Decomposition, Multi-Agent Reinforcement Learning
TL;DR: We propose a novel value factorisation method to deal with non-monotonic and stochastic target joint action-values.
Abstract: Extracting decentralised policies from joint action-values is an attractive way to exploit centralised learning. It is possible to apply monotonic value factorisation to guarantee consistency between the centralised and decentralised policies. However, the best strategy for training decentralised policies when the target joint action-values are non-monotonic and stochastic is still unclear. We propose a novel value factorisation method named uncertainty-based target shaping (UTS) to solve this problem. UTS employs networks that estimate the reward and the following state's embedding, where the large prediction error indicates that the target is stochastic. By replacing deterministic targets for the suboptimal with the best per-agent values, we enforce that all shaped targets become a subset of the space that can be represented by monotonic value factorisation. Empirical results show that UTS outperforms state-of-the-art baselines on multiple benchmarks, including matrix games, predator-prey, and challenging tasks in the StarCraft II micromanagement.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)
9 Replies
Loading