Keywords: optimal transport, gradient flow, optimization, Wasserstein, Fisher-Rao, information geometry, Bayesian inference, sampling, kernel methods, diffusion, generative models
Abstract: Otto's Wasserstein gradient flow of the exclusive KL divergence functional provides a powerful and mathematically principled perspective for analyzing machine learning and Bayesian inference algorithms. In contrast, algorithms for the inclusive KL inference, i.e., minimizing $\mathrm{KL}(\pi|\mu)$ w.r.t. $\mu$ for some target $\pi$, are rarely analyzed using tools from mathematical analysis. This paper shows that a general-purpose approximate inclusive KL inference paradigm can be constructed using the theory of gradient flows derived from PDE analysis. We uncover some precise relationships between the inclusive KL inference and some widely used learning algorithms, including the MMD-minimization and the Wasserstein flow of kernel discrepancies, which are widely used in machine learning applications. For example, a few existing sampling algorithms, such as those based on the Wasserstein flow of kernel discrepancies, can be viewed in a unified manner as inclusive-KL inference with approximate gradient estimators. Finally, we provide the theoretical foundation for the Fisher-Rao type gradient flows for minimizing the inclusive KL divergence.
Submission Number: 22
Loading