Multi-objective Linear Reinforcement Learning with Lexicographic Rewards

Bo Xue; Dake Bu; Ji Cheng; Yuanyu Wan; Qingfu Zhang

Multi-objective Linear Reinforcement Learning with Lexicographic Rewards

Bo Xue, Dake Bu, Ji Cheng, Yuanyu Wan, Qingfu Zhang

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We provide a theoretical analysis of regret bounds for multi-objective reinforcement learning.

Abstract: Reinforcement Learning (RL) with linear transition kernels and reward functions has recently attracted growing attention due to its computational efficiency and theoretical advancements. However, prior theoretical research in RL has primarily focused on single-objective problems, resulting in limited theoretical development for multi-objective reinforcement learning (MORL). To bridge this gap, we examine MORL under lexicographic reward structures, where rewards comprise $m$ hierarchically ordered objectives. In this framework, the agent the agent maximizes objectives sequentially, prioritizing the highest-priority objective before considering subsequent ones. We introduce the first MORL algorithm with provable regret guarantees. For any objective $i \in \\{1, 2, \ldots, m\\}$, our algorithm achieves a regret bound of $\widetilde{O}(\Lambda^i(\lambda) \cdot \sqrt{d^2H^4 K})$, where $\Lambda^i(\lambda) = 1 + \lambda + \cdots + \lambda^{i-1}$, $\lambda$ quantifies the trade-off between conflicting objectives, $d$ is the feature dimension, $H$ is the episode length, and $K$ is the number of episodes. Furthermore, our algorithm can be applied in the misspecified setting, where the regret bound for the $i$-th objective becomes $\widetilde{O}(\Lambda^i(\lambda)\cdot(\sqrt{d^2H^4K}+\epsilon dH^2K))$, with $\epsilon$ denoting the degree of misspecification.

Lay Summary: Reinforcement learning (RL) works well when an agent optimizes for a single goal, but many real-world problems require balancing multiple, sometimes competing objectives, like maximizing efficiency while minimizing risk. While single-objective RL has strong theoretical foundations, multi-objective RL (MORL) lacks similar guarantees, making it harder to trust in practical applications. To address this, we focus on lexicographic MORL, where objectives are ranked by importance, e.g., safety first, then performance. We develop the first MORL algorithm with mathematically proven regret bounds, meaning we can quantify how well it performs compared to the best possible strategy. Even when the environment is slightly misrepresented, our method remains robust. This research matters because it provides a principled way to handle real-world tasks where trade-offs are unavoidable, from autonomous driving to healthcare. By guaranteeing performance while respecting priorities, our work helps build more reliable and transparent AI systems.

Link To Code: YTlmY

Primary Area: Theory->Reinforcement Learning and Planning

Keywords: Reinforcement learning, multi-objective, linear model

Submission Number: 9585

Loading