Keywords: Emergence Control; Multi-agent reinforcement learning; conditional diffusion model; Evolutionary strategy;
TL;DR: This paper aims to learn the individual policies of 1000+ agents based on multi-agent reinforcement learning (MARL).
Abstract: Reward model learning methods are primarily divided into implicit reward modeling(IRM) and explicit reward modeling. Implicit reward modeling aims to learn the intrinsic reward of each agent to facilitate better explorationwhile explicit reward modeling(ERM) aims to learn the behavioral preferences of agents. The biggest difference between implicit and explicit reward modeling is that ERM is transferable to other scenarios but IRM cannot. Currently, few methods can simultaneously archive the global target and also learn the ERM. However, the problem addressed in this paper requires the integration of the two objectives. This paper use the diffusion model to generate the expert data for learning ERM. Since the traditional diffusion model can only generate data according to the given expert data, we introduce evolutionary diffusion model to generate data in the absence of any expert data. To steer the collaboration of all agents towards the specified macro-level objective, the macro-level objective is adopted as the fitness for the population. This mechanism transforms the multi-agent exploration based on intrinsic reward into the evolutionary exploration based on genetic operators. Moreover, the optimal individual retention strategy in the evolutionary diffusion model can address the non-stationary problem in MARL. In the experiments, we demonstrate that MASDiff can simultaneously archive the two objectives. Furthermore, we demonstrate that the ability to conduct counterfactual reasoning with the transferable ERM in different scenarios. We propose several ‘What if’ questions to indicate the change of scenarios and obtain relatively accurate counterfactual reasoning results.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 10571
Loading