Beyond Softmax and Entropy: $f$-Regularized Policy Gradients with Coupled Parametrizations

Beyond Softmax and Entropy: $f$-Regularized Policy Gradients with Coupled Parametrizations

ICLR 2026 Conference Submission18709 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: policy gradient methods, reinforcement learning theory, f-divergence, Tsallis entropy, Shannon entropy

TL;DR: We propose an $f$-divergence–regularized policy gradient method with coupled parameterization, providing explicit global last-iterate convergence rates in the stochastic setting.

Abstract: We introduce $\texttt{f-PG}$, a new class of stochastic policy gradient methods regularized by a family of $f$-divergences, including entropy and Tsallis divergences. For each divergence, we employed a $\textit{coupled}$ parameterization, defined by $f$-softargmax, which allows us to establish the first explicit, non-asymptotic, last-iterate convergence rates for stochastic policy gradient. To derive our analysis, we prove that the $f$-regularized value function is smooth and satisfies a Polyak-Łojasiewicz inequality as a function of $f$-softargmax parameters. To establish the latter, we introduce a general policy improvement operator that restricts optimization to a well-defined policy space that excludes ill-behaved policies. In the case of softmax, this allows to escape the "gravitational pull" and yields the first $\textit{explicit}$ convergence guarantees for this parameterization, closing a gap in the literature. Finally, we leverage these rates to derive sample complexity bounds for the unregularized problem and show that $\texttt{f-PG}$ with Tsallis divergences provides a provably better sample complexity/regularization bias trade-off compared to softmax-based policy gradient with entropy regularization.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 18709

Loading