Enabling Pareto-Stationarity Exploration in Multi-Objective Reinforcement Learning: A Weighted-Chebyshev Multi-Objective Actor-Critic Approach

FNU Hairi; Yang Jiao; Tianchen Zhou; Haibo Yang; Chaosheng Dong; Fan Yang; Michinari Momma; Yan Gao; Jia Liu

Enabling Pareto-Stationarity Exploration in Multi-Objective Reinforcement Learning: A Weighted-Chebyshev Multi-Objective Actor-Critic Approach

FNU Hairi, Yang Jiao, Tianchen Zhou, Haibo Yang, Chaosheng Dong, Fan Yang, Michinari Momma, Yan Gao, Jia Liu

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-Objective Reinforcement Learning, Actor-Critic Algorithm

Abstract: In many multi-objective reinforcement learning (MORL) applications, being able to systematically explore the Pareto-stationary solutions under multiple non-convex reward objectives with theoretical finite-time sample complexity guarantee is an important and yet under-explored problem. This motivates us to take the first step and fill the important gap in MORL. Specifically, in this paper, we propose a weighted-Chebyshev multi-objective actor-critic (\policyns) algorithm for MORL, which uses multi-temporal-difference (TD) learning in the critic step and judiciously integrates the weighted-Chebychev (WC) and multi-gradient descent techniques in the actor step to enable systematic Pareto-stationarity exploration with finite-time sample complexity guarantee. Our proposed \policy algorithm achieves a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-2}p^{-2}\_{\min})$ in finding an $\epsilon$-Pareto-stationary solution, where $p_{\min}$ denotes the minimum entry of a given weight vector $p$ in the WC-scarlarization. This result not only implies a state-of-the-art sample complexity that is independent of objective number $M$, but also brand-new dependence result in terms of the preference vector $p$. Furthermore, simulation studies on a large KuaiRand offline dataset, show that the performance of our \policy algorithm significantly outperforms other baseline MORL approaches.

Supplementary Material: pdf

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12698

Loading