Policy Optimization with $f$-Divergence Regularization

Dawei Zhang; Junfeng Wen

Policy Optimization with $f$-Divergence Regularization

Dawei Zhang, Junfeng Wen

14 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, f-divergence, policy optimization

TL;DR: We develop an iterative policy optimization algorithm using f-divergence regularization, with a monotonic improvement guarantee and competitive results in both online and offline settings.

Abstract: Policy iteration is a common algorithm framework in reinforcement learning (RL) to find the optimal policy for a Markov decision process (MDP). To improve training stability and prevent catastrophic failure, researchers have developed several policy iteration algorithms based on the Kullback-Leibler (KL) divergence, such as the well-known trust region policy optimization (TRPO) and proximal policy optimization (PPO). However, these methods are limited to the KL divergence, which may not be the best choice for all environments. In this work, we generalize previous work using a more general form of divergence, the $f$-divergence, and design a new family of algorithms that can improve learning policy with theoretical improvement guarantees. Our method, $f$-divergence-regularized policy optimization ($f$RPO), can be applied to both online and offline RL settings. Empirical studies show that $f$RPO can outperform existing methods, including the commonly used KL divergence, on common benchmark problems in RL.

Primary Area: reinforcement learning

Submission Number: 5269

Loading