Abstract: Multi-objective reinforcement learning (MORL) has shown great promise in many real-world applications. Existing MORL algorithms often aim to learn a policy that optimizes individual objective functions simultaneously with a given prior preference (or weights) on different objectives. However, these methods often suffer from the issue of gradient conflict such that the objectives with larger gradients dominate the update direction, resulting in a performance degeneration on other objectives. In this paper, we develop a novel dynamic weighting multi-objective actor-critic algorithm (MOAC) under two options of sub-procedures named as conflict-avoidant (CA) and faster convergence (FC) in objective weight updates. MOAC-CA aims to find a CA update direction that maximizes the minimum value improvement among objectives, and MOAC-FC targets at a much faster convergence rate. We provide a comprehensive finite-time convergence analysis for both algorithms. We show that MOAC-CA can find a $\epsilon +\epsilon _{\text {app}}$ -accurate Pareto stationary policy using $\mathcal {O}({\epsilon ^{-5}})$ samples, while ensuring a small $\epsilon +\sqrt {\epsilon _{\text {app}}}$ -level CA distance (defined as the distance to the CA direction), where $\epsilon _{\text {app}}$ is the function approximation error. The analysis also shows that MOAC-FC improves the sample complexity to $\mathcal {O}(\epsilon ^{-3})$ , but with a constant-level CA distance. Our experiments on MT10 demonstrate the improved performance of our algorithms over existing MORL methods with fixed preference.
External IDs:dblp:journals/tit/WangXBJZ25
Loading