MOSDT: Self-Distillation-Based Decision Transformer for Multi-Agent Offline Safe Reinforcement Learning
Keywords: multi-agent offline safe reinforcement learning, decision transformer, self-distillation
TL;DR: For multi-agent offline safe reinforcement learning (MOSRL), we propose the first algorithm MOSDT, and the first dataset and benchmark MOSDB.
Abstract: We introduce MOSDT, the first algorithm designed for multi-agent offline safe reinforcement learning (MOSRL), alongside MOSDB, the first dataset and benchmark for this domain. Different from most existing knowledge distillation-based multi-agent RL methods, we propose policy self-distillation (PSD) with a new global information reconstruction scheme by fusing the observation features of all agents, streamlining training and improving parameter efficiency. We adopt full parameter sharing across agents, significantly slashing parameter count and boosting returns up to 38.4-fold by stabilizing training. We propose a new plug-and-play cost binary embedding (CBE) module, which binarizes cumulative costs as safety signals and embeds the signals into return features for efficient information aggregation. On the strong MOSDB benchmark, MOSDT achieves state-of-the-art (SOTA) returns in 14 out of 18 tasks (across all base environments including MuJoCo, Safety Gym, and Isaac Gym) while ensuring complete safety, with only 65% of the execution parameter count of a SOTA single-agent offline safe RL method CDT. Code, dataset, and results are available at this website: https://github.com/Lucian1115/MOSDT.git
Supplementary Material: zip
Primary Area: Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)
Submission Number: 4775
Loading