Reinventing Policy Iteration under Time Inconsistency

Published: 25 Nov 2022, Last Modified: 28 Feb 2023Accepted by TMLREveryoneRevisionsBibTeX
Abstract: Policy iteration (PI) is a fundamental policy search algorithm in standard reinforcement learning (RL) setting, which can be shown to converge to an optimal policy by policy improvement theorems. However, the standard PI relies on Bellman’s Principle of Optimality, which might be violated by some specifications of objectives (also known as time-inconsistent (TIC) objectives), such as non-exponentially discounted reward functions. The use of standard PI under TIC objectives has thus been marked with questions regarding the convergence of its policy improvement scheme and the optimality of its termination policy, often leading to its avoidance. In this paper, we consider an infinite-horizon TIC RL setting and formally present an alternative type of optimality drawn from game theory, i.e., subgame perfect equilibrium (SPE), that attempts to resolve the aforementioned questions. We first analyze standard PI under the SPE type of optimality, revealing its merits and insufficiencies. Drawing on these observations, we propose backward Q-learning (bwdQ), a new algorithm in the approximate PI family that targets SPE policy under non-exponentially discounted reward functions. Finally, with two TIC gridworld environments, we demonstrate the implications of our theoretical findings on the behavior of bwdQ and other approximate PI variants.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Mohammad_Ghavamzadeh1
Submission Number: 317