Unbiased scalable softmax optimization


Nov 03, 2017 (modified: Dec 14, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Recent state-of-the-art neural network and language models have begun to rely on softmax distributions with an extremely large number of categories. In this context calculating the softmax normalizing constant is prohibitively expensive, which has spurred a growing literature of efficiently computable but biased estimates of the softmax. In this paper we present the first two unbiased algorithms for optimizing the softmax whose work per iteration is independent of the number of classes and datapoints (and does not require extra work at the end of each epoch). We compare their empirical performance to the state-of-the-art on seven real world datasets, with our Implicit SGD algorithm comprehensively outperforming all competitors.
  • TL;DR: Propose first methods for exactly optimizing the softmax distribution using stochastic gradient with runtime independent on the number of classes or datapoints.
  • Keywords: softmax, optimization, implicit sgd