A Matrix Approximation View of NCE that Justifies Self-Normalization

Jacob Goldberger; Oren Melamud

A Matrix Approximation View of NCE that Justifies Self-Normalization

Jacob Goldberger, Oren Melamud

14 Dec 2017 (modified: 25 Jan 2018)ICLR 2018 Conference Withdrawn SubmissionReaders: Everyone

Abstract: Self-normalizing discriminative models approximate the normalized probability of a class without having to compute the partition function. This property is useful to computationally-intensive neural network classifiers, as the cost of computing the partition function grows linearly with the number of classes and may become prohibitive. In particular, since neural language models may deal with up to millions of classes, their self-normalization properties received notable attention. Several recent studies empirically found that language models, trained using Noise Contrastive Estimation (NCE), exhibit self-normalization, but could not explain why. In this study, we provide a theoretical justification to this property by viewing NCE as a low-rank matrix approximation. Our empirical investigation compares NCE to the alternative explicit approach for self-normalizing language models. It also uncovers a surprising negative correlation between self-normalization and perplexity, as well as some regularity in the observed errors that may potentially be used for improving self-normalization algorithms in the future.

TL;DR: We prove that NCE is self-normalized and demonstrate it on datasets

Keywords: language modeling, NCE, self-normalization

3 Replies

Loading