Boosting Offline MARL under Imbalanced Datasets via Compositional Diffusion Models
Keywords: Multi-agent systems, Reinforcement learning, Diffusion models
Abstract: Offline multi-agent reinforcement learning (MARL) is hampered by agent-quality imbalance in datasets, where the entanglement of expert and suboptimal behaviors from heterogeneous behavior policies inhibits effective policy learning. Conventional offline MARL methods overfit to these suboptimal behaviors, leading to significant performance degradation. A promising solution is data augmentation using generative models like diffusion model, which can generate balanced, high-quality trajectories to enrich the dataset. However, existing methods usually adopt a standard diffusion process, conditioning generation solely on team-level signals such as global return. This coarse guidance lacks active, fine-grained, agent-level control, limiting the diffusion model's ability to produce high-quality cooperative behaviors that generalize beyond the dataset. To address this, we propose Compositional Diffusion for Imbalanced Datasets (CODI), a novel framework that leverages large language models (LLMs) and diffusion models to generate balanced, high-quality trajectories. CODI first distills an agent quality labeler from an LLM to annotate the dataset. It then employs a conditional diffusion model that generates trajectory segments based on not only return-to-go but also fine-grained agent quality labels. Crucially, to effective compose scattered high-quality behaviors and enable generalization, CODI decomposes the target team quality into in-distribution agent-level labels for compositional diffusion generation. These generated segments are subsequently stitched into complete trajectories, augmenting the dataset. Extensive evaluation on challenging imbalanced datasets, where only a single agent is an expert, shows that CODI successfully mitigates data imbalance and facilitates the learning of strong cooperative policies, recovering 63% of the performance achieved with a balanced expert dataset and substantially outperforming baseline methods.
Area: Learning and Adaptation (LEARN)
Generative A I: I acknowledge that I have read and will follow this policy.
Submission Number: 574
Loading