Abstract: The Kullback-Leibler (KL) divergence plays a central role in probabilistic machine learning, where it commonly serves as the canonical loss function. Optimization in such settings is often performed over the probability simplex, where the choice of parameterization significantly impacts convergence. In this work, we study the problem of minimizing the KL divergence and analyze the behavior of gradient-based optimization algorithms under two dual coordinate systems within the framework of information geometry$-$ the exponential family ($\theta$ coordinates) and the mixture family ($\eta$ coordinates). We compare Euclidean gradient descent (GD) in these coordinates with the coordinate-invariant natural gradient descent (NGD), where the natural gradient is a Riemannian gradient that incorporates the intrinsic geometry of the underlying statistical model. In continuous time, we prove that the convergence rates of GD in the $\theta$ and $\eta$ coordinates provide lower and upper bounds, respectively, on the convergence rate of NGD. Moreover, under affine reparameterizations of the dual coordinates, the convergence rates of GD in $\eta$ and $\theta$ coordinates can be scaled to $2c$ and $\frac{2}{c}$, respectively, for any $c>0$, while NGD maintains a fixed convergence rate of $2$, remaining invariant to such transformations and sandwiched between them.
Although this suggests that NGD may not exhibit uniformly superior convergence in continuous time, we demonstrate that its advantages become pronounced in discrete time, where it achieves faster convergence and greater robustness to noise, outperforming GD. Our analysis hinges on bounding the spectrum and condition number of the Hessian of the KL divergence at the optimum, which coincides with the Fisher information matrix.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Second revision on 13 June 2025: This revision incorporates and addresses the remaining concerns of the reviewers. In particular, a new Section 5 is added that presents numerical studies with empirical KL divergence in the full-batch setting as well the mini-batch stochastic gradient descent setting. To incorporate the reviewers' comments thoroughly, the main manuscript has been extended beyond 12 pages.
First revision on 04 Jun 2025,: This revision incorporates the changes (marked in red) requested by the reviewers. Please see the formal author responses for detailed changes. To incorporate the reviewers' comments thoroughly, the main manuscript has been extended from 12 to 15 pages.
Assigned Action Editor: ~Stephen_Becker1
Submission Number: 4787
Loading