The Cramer Distance as a Solution to Biased Wasserstein Gradients


Nov 03, 2017 (modified: Nov 03, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: The Wasserstein probability metric has received much attention from the machine learning community. Unlike the Kullback-Leibler divergence, which strictly measures change in probability, the Wasserstein metric reflects the underlying geometry between outcomes. The value of being sensitive to this geometry has been demonstrated, among others, in ordinal regression and generative modelling, and most recently in reinforcement learning. In this paper we describe three natural properties of probability divergences that we believe reflect requirements from machine learning: sum invariance, scale sensitivity, and unbiased sample gradients. The Wasserstein metric possesses the first two properties but, unlike the Kullback-Leibler divergence, does not possess the third. We provide empirical evidence suggesting this is a serious issue in practice. Leveraging insights from probabilistic forecasting we propose an alternative to the Wasserstein metric, the Cramér distance. We show that the Cramér distance possesses all three desired properties, combining the best of the Wasserstein and Kullback-Leibler divergences. We give empirical results on a number of domains comparing these three divergences. To illustrate the practical relevance of the Cramér distance we design a new algorithm, the Cramér Generative Adversarial Network (GAN), and show that it has a number of desirable properties over the related Wasserstein GAN.
  • TL;DR: The Wasserstein distance is hard to minimize with stochastic gradient descent, while the Cramer distance can be optimized easily and works just as well.
  • Keywords: Probability metrics, Wasserstein metric, stochastic gradient descent, GANs