Design Principles for TD-based Multi-Policy MORL in Infinite Horizons

Design Principles for TD-based Multi-Policy MORL in Infinite Horizons

ICLR 2026 Conference Submission18331 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-Objective Reinforcement Learning, Multi-Policy Learning, Temporal-Difference Learning

Abstract: Multi-Objective Reinforcement Learning (MORL) addresses problems with multiple, often conflicting goals by seeking a set of trade-off policies rather than a single solution. Existing approaches that learn many policies at once have shown promise in deep settings, but they depend on supervised retraining and carefully curated data, making them ill-suited for online and infinite-horizon tasks. Temporal-Difference (TD) methods offer a natural alternative, as they update policies incrementally during interaction, but current TD-based approaches are limited to small, episodic problems. In this work, we present design principles for extending TD-based multi-policy MORL to infinite horizons, realized in a framework that combines trajectory-based policy tracking, mechanisms for learning both predictable (stationary) and flexible (non-stationary) policies, techniques to avoid spurious dominance relations, and cycle detection to ensure well-defined long-term behavior. Through ablation studies, we show how each principle contributes to recovering diverse and reliable policies, providing a principled path toward scalable TD-based multi-policy methods in deep MORL.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 18331

Loading