O-MAPL: Offline Multi-agent Preference Learning

The Viet Bui; Tien Anh Mai; Thanh Hong Nguyen

O-MAPL: Offline Multi-agent Preference Learning

The Viet Bui, Tien Anh Mai, Thanh Hong Nguyen

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Offline Multi-agent Preference Learning

Abstract: Inferring reward functions from demonstrations is a key challenge in reinforcement learning (RL), particularly in multi-agent RL (MARL). The large joint state-action spaces and intricate inter-agent interactions in MARL make inferring the joint reward function especially challenging. While prior studies in single-agent settings have explored ways to recover reward functions and expert policies from human preference feedback, such studies in MARL remain limited. Existing methods typically combine two separate stages, supervised reward learning, and standard MARL algorithms, leading to unstable training processes. In this work, we exploit the inherent connection between reward functions and Q functions in cooperative MARL to introduce a novel end-to-end preference-based learning framework. Our framework is supported by a carefully designed multi-agent value decomposition strategy that enhances training efficiency. Extensive experiments on two state-of-the-art benchmarks, SMAC and MAMuJoCo, using preference data generated by both rule-based and large language model approaches demonstrate that our algorithm consistently outperforms existing methods across various tasks.

Lay Summary: Teaching AI teams to cooperate is hard if we can't perfectly define a scoring system for "good" behavior. An alternative is "preference learning": simply showing the AI two attempts and indicating which was better. Our new method, O-MAPL, enables AI teams to learn directly from a pre-existing dataset of such "better/worse" examples. Unlike many previous approaches that first try to build a scoring system from these preferences before training the team, O-MAPL skips this potentially unstable intermediate step. This direct approach leads to more stable and efficient learning, and includes a specialized technique for managing team coordination. When tested in complex game simulations (like StarCraft and robotics), O-MAPL helped AI teams learn to cooperate and perform more successfully than other methods.

Primary Area: Reinforcement Learning->Multi-agent

Keywords: Multi-agent Reinforcement Learning, Preference Learning

Submission Number: 15169

Loading