Keywords: Kullback-Leibler divergence, statistical divergence, multivariate Gaussian distribution, mathematical optimization, Lambert $W$ function, machine learning, flow-based model, reinforcement learning
TL;DR: We prove the approximate symmetry and relaxed triangle inequality of KL divergence between multivariate Gaussian distributions. We also demonstrate the applications of our theorems.
Abstract: Kullback-Leibler (KL) divergence is one of the most important measures to calculate the difference between probability distributions. In this paper, we theoretically study several properties of KL divergence between multivariate Gaussian distributions. Firstly, for any two $n$-dimensional Gaussian distributions $\mathcal{N}_1$ and $\mathcal{N}_2$, we prove that when $KL(\mathcal{N}_2||\mathcal{N}_1)\leq \varepsilon\ (\varepsilon>0)$ the supremum of $KL(\mathcal{N}_1||\mathcal{N}_2)$ is $(1/2)\left((-W_{0}(-e^{-(1+2\varepsilon)}))^{-1}+\log(-W_{0}(-e^{-(1+2\varepsilon)})) -1 \right)$, where $W_0$ is the principal branch of Lambert $W$ function. For small $\varepsilon$, the supremum is $\varepsilon + 2\varepsilon^{1.5} + O(\varepsilon^2)$. This quantifies the approximate symmetry of small KL divergence between Gaussian distributions. We further derive the infimum of $KL(\mathcal{N}_1||\mathcal{N}_2)$ when $KL(\mathcal{N}_2||\mathcal{N}_1)\geq M\ (M>0)$. We give the conditions when the supremum and infimum can be attained. Secondly, for any three $n$-dimensional Gaussian distributions $\mathcal{N}_1$, $\mathcal{N}_2$, and $\mathcal{N}_3$, we theoretically show that an upper bound of $KL(\mathcal{N}_1||\mathcal{N}_3)$ is $3\varepsilon_1+3\varepsilon_2+2\sqrt{\varepsilon_1\varepsilon_2}+o(\varepsilon_1)+o(\varepsilon_2)$ when $KL(\mathcal{N}_1||\mathcal{N}_2)\leq \varepsilon_1$ and $KL(\mathcal{N}_2||\mathcal{N}_3)\leq \varepsilon_2$ ($\varepsilon_1,\varepsilon_2\ge 0$). This reveals that KL divergence between Gaussian distributions follows a relaxed triangle inequality. Note that, all these bounds in the theorems presented in this work are independent of the dimension $n$. Finally, we discuss several applications of our theories in deep learning, reinforcement learning, and sample complexity research.
Supplementary Material: pdf
Submission Number: 5834
Loading