Improving generalization with Wasserstein regularization


Nov 07, 2017 (modified: Nov 07, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Natural gradients are expensive to calculate but have been shown to both improve convergence speed and generalization. We point out that the natural gradient is the optimal update when one regularizes the Kullbeck-Leibler divergence between the output distributions of successive updates. The natural gradient can thus be seen as a regularization term upon the change in the parameterized function. With this intuition, we propose it is possible to achieve the same effect more efficiently by choosing and regularizing a simpler metric of two distributions' similarity. The resulting algorithm, which we term Wasserstein regularization, explicitly penalizes changes in predictions on a held-out set. It can be interpreted as a way of regularizing that encourages simple functions. Experiments show that the Wasserstein regularization is efficient and leads to considerably better generalization.
  • TL;DR: We show that the natural gradient, which underlies many successful practices, is a regularization term upon changes in the function outputs. Wasserstein regularization is cheaper and easier to compute but performs the same task.
  • Keywords: natural gradient, generalization, optimization