TL;DR: This work proposes the first algorithm that converges to the desired performatively optimal policy with polynomial computation complexity for performative reinforcement learning.
Abstract: Performative reinforcement learning is an emerging dynamical decision making framework, which extends reinforcement learning to the common applications where the agent's policy can change the environmental dynamics. Existing works on performative reinforcement learning only aim at a performatively stable (PS) policy that maximizes an approximate value function. However, there can be a positive constant gap between the PS policy and the desired performatively optimal (PO) policy that maximizes the original value function. In contrast, this work proposes a zeroth-order performative policy gradient (0-PPG) algorithm that **for the first time converges to the desired PO policy with polynomial computation complexity under mild conditions**. For the convergence analysis, we prove two important properties of the nonconvex value function. First, when the policy regularizer dominates the environmental shift, the value function satisfies a certain gradient dominance property, so that any stationary point of the value function is a desired PO. Second, though the value function has unbounded gradient, we prove that all the sufficiently stationary points lie in a convex and compact policy subspace $\Pi_{\Delta}$, where the policy value has a constant lower bound $\Delta>0$ and thus the gradient becomes bounded and Lipschitz continuous.
Primary Area: Reinforcement Learning
Keywords: performative reinforcement learning, performatively optimal
Submission Number: 13617
Loading