A Unified Objective for On-Policy Reinforcement Learning in Stationary and Non-Stationary Environments

A Unified Objective for On-Policy Reinforcement Learning in Stationary and Non-Stationary Environments

ICLR 2026 Conference Submission16629 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Discounted Return, Average Return, On-policy, Stationary State Distribution

Abstract: A fundamental dichotomy between the discounted and average return has long existed in the field of deep reinforcement learning (DRL). Algorithms based on the average return assume the existence of stationary state distribution and often struggle in non-stationary or episodic settings. In contrast, algorithms optimizing the discounted return are well-suited for non-stationary tasks but may learn suboptimal policies in long-term stationary settings due to the inherent bias introduced by the discount factor. This forces practitioners to select an objective based on the specific environment, thereby limiting the development of general and robust DRL algorithms. We introduce the \textbf{$k$-sliding-window return}, a novel objective that bridges these two criteria. We instantiate this concept with a practical on-policy algorithm, $k$-sliding-window PPO ($k$SW-PPO). Besides, we provide theoretical analysis showing that the loss of our objective converges to that of the average return while maintaining a bounded bias relative to the discounted return. We validate our claims through experiments on a suite of MuJoCo continuous control tasks. The results demonstrate that $k$SW-PPO achieves performance competitive with average-return PPO in stationary environments, while matching the performance of its discounted-return counterpart in non-stationary settings. Our results establish the $k$-sliding-window return as a unified objective that eliminates the need for an a priori choice between discounting and averaging, which we hope to inspire the development of more robust and general-purpose DRL algorithms.

Primary Area: reinforcement learning

Submission Number: 16629

Loading