Abstract: Recent neural network and language models have begun to rely on softmax distributions with an extremely large number of categories. In this context calculating the softmax normalizing constant is prohibitively expensive. This has spurred a growing literature of efficiently computable but biased estimates of the softmax. In this paper we present the first two unbiased algorithms for maximizing the softmax likelihood whose work per iteration is independent of the number of classes and datapoints (and does not require extra work at the end of each epoch). We compare our unbiased methods' empirical performance to the state-of-the-art on seven real world datasets, where they comprehensively outperform all competitors.
TL;DR: Propose first methods for exactly optimizing the softmax distribution using stochastic gradient with runtime independent on the number of classes or datapoints.
Keywords: softmax, optimization, implicit sgd
8 Replies
Loading