Abstract: Self-normalizing discriminative models approximate the normalized probability of a class without having to compute the partition function. This property is useful to computationally-intensive neural network classifiers, as the cost of computing the partition function grows linearly with the number of classes and may become prohibitive. In particular, since neural language models may deal with up to millions of classes, their self-normalization properties received notable attention. Several
recent studies empirically found that language models, trained using Noise Contrastive Estimation (NCE), exhibit self-normalization, but could not explain why. In this study, we provide a theoretical justification to this property by viewing
NCE as a low-rank matrix approximation. Our empirical investigation compares NCE to the alternative explicit approach for self-normalizing language models. It also uncovers a surprising negative correlation between self-normalization and
perplexity, as well as some regularity in the observed errors that may potentially be used for improving self-normalization algorithms in the future.
TL;DR: We prove that NCE is self-normalized and demonstrate it on datasets
Keywords: language modeling, NCE, self-normalization
3 Replies
Loading