Monotone and Conservative Policy Iteration Beyond the Tabular Case

Published: 03 Feb 2026, Last Modified: 03 Feb 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose RPI, a policy iteration method for function approximation with provably monotone, lower-bounding value estimates , and its variant CRPI, which adds per-step provable improvement guarantees.
Abstract: We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative PI (CPI), that retain tabular guarantees under function approximation. RPI replaces the Bellman-error–based policy evaluation with a Bellman-constrained optimization. We prove that RPI restores textbook monotonicity of value estimates and that these estimates provably lower-bound the true return. Their limit partially satisfies the unprojected Bellman equation, underscoring RPI’s alignment with RL foundations. For \CRPI, we prove a performance-difference lower bound that accounts for function-approximation errors and approximate advantages, and we update policies by maximizing this bound. Our work addresses a foundational gap in RL: popular algorithms such as TRPO and PPO derive from tabular CPI yet are deployed with function approximation, where CPI’s guarantees often fail—leading to divergence, oscillations, or convergence to suboptimal policies. By restoring PI/CPI’s guarantees for arbitrary function classes, RPI provides a principled basis for robust, next-generation RL.
Submission Number: 2048
Loading