A Matrix Approximation View of NCE that Justifies Self-Normalization


Nov 03, 2017 (modified: Nov 03, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Self-normalizing discriminative models approximate the normalized probability of a class without having to compute the partition function. This property is useful to computationally-intensive neural network classifiers, as the cost of computing the partition function grows linearly with the number of classes and may become prohibitive. In particular, since neural language models may deal with up to millions of classes, their self-normalization properties received notable attention. Several recent studies empirically found that language models, trained using Noise Contrastive Estimation (NCE), exhibit self-normalization, but could not explain why. In this study, we provide a theoretical justification to this property by viewing NCE as a low-rank matrix approximation. Our empirical investigation compares NCE to the alternative explicit approach for self-normalizing language models. It also uncovers a surprising negative correlation between self-normalization and perplexity, as well as some regularity in the observed errors that may potentially be used for improving self-normalization algorithms in the future.
  • TL;DR: We prove that NCE is self-normalized and demonstrate it on datasets
  • Keywords: language modeling, NCE, self-normalization