Safe Multi-Objective Reinforcement Learning via Multi-Party Pareto Negotiation

Safe Multi-Objective Reinforcement Learning via Multi-Party Pareto Negotiation

ICLR 2026 Conference Submission15968 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-party Multi-objective Reinforcement Learning; Constrained Reinforcement Learning; Multi-objective Reinforcement Learning

TL;DR: The paper proposes a multi-party negotiation framework for safe multi-objective reinforcement learning, allowing for a Pareto front of policies that balance efficiency and safety constraints.

Abstract: Safe multi-objective reinforcement learning (Safe MORL) seeks to optimize performance while satisfying safety constraints. Existing methods face two key challenges: (i) incorporating safety as additional objectives enlarges the objective space, requiring more solutions to uniformly cover the Pareto front and maintain adaptability under changing preferences; (ii) strictly enforcing safety constraints is feasible for single or compatible constraints, but conflicting constraints prevent flexible, preference-aware trade-offs. To address these challenges, we cast Safe MORL within a multi-party negotiation framework that treats safety as an external regulatory perspective, enabling the search for a consensus-based multi-party Pareto-optimal set. We propose a multi-party Pareto negotiation (MPPN) strategy built on NSGA-II, which employs a negotiation threshold $\varepsilon$ to represent the acceptable solution range for each party. During evolutionary search, $\varepsilon$ is dynamically adjusted to maintain a sufficiently large negotiated solution set, progressively steering the population toward the $(\varepsilon_{\text{efficiency}}, \varepsilon_{\text{safety}})$-negotiated common Pareto set. The framework preserves user preferences over conflicting safety constraints without introducing additional objectives and flexibly adapts to emergent scenarios through progressively guided $(\varepsilon_{\text{efficiency}}, \varepsilon_{\text{safety}})$. Experiments on a MuJoCo benchmark show that our approach outperforms state-of-the-art methods in both constrained and unconstrained MORL, as measured by multi-party hypervolume and sparsity metrics, while supporting preference-aware policy selection across stakeholders.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 15968

Loading