Keywords: policy gradient methods, reinforcement learning theory, f-divergence, Tsallis entropy, Shannon entropy
TL;DR: We propose an $f$-divergence–regularized policy gradient method with coupled parameterization, providing explicit global last-iterate convergence rates in the stochastic setting.
Abstract: We introduce $\texttt{f-PG}$, a new class of stochastic policy gradient methods regularized by a family of $f$-divergences, including entropy and Tsallis divergences. For each divergence, we employed a $\textit{coupled}$ parameterization, defined by $f$-softargmax, which allows us to establish the first explicit, non-asymptotic, last-iterate convergence rates for stochastic policy gradient.
To derive our analysis, we prove that the $f$-regularized value function is smooth and satisfies a Polyak-Łojasiewicz inequality as a function of $f$-softargmax parameters. To establish the latter, we introduce a general policy improvement operator that restricts optimization to a well-defined policy space that excludes ill-behaved policies. In the case of softmax, this allows to escape the "gravitational pull" and yields the first $\textit{explicit}$ convergence guarantees for this parameterization, closing a gap in the literature.
Finally, we leverage these rates to derive sample complexity bounds for the unregularized problem and show that $\texttt{f-PG}$ with Tsallis divergences provides a provably better sample complexity/regularization bias trade-off compared to softmax-based policy gradient with entropy regularization.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 18709
Loading