Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

Zhilin Yang; Zihang Dai; Ruslan Salakhutdinov; William W. Cohen

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, William W. Cohen

15 Feb 2018 (modified: 22 Jun 2025)ICLR 2018 Conference Blind SubmissionReaders: Everyone

Abstract: We formulate language modeling as a matrix factorization problem, and show that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck. Given that natural language is highly context-dependent, this further implies that in practice Softmax with distributed word embeddings does not have enough capacity to model natural language. We propose a simple and effective method to address this issue, and improve the state-of-the-art perplexities on Penn Treebank and WikiText-2 to 47.69 and 40.68 respectively. The proposed method also excels on the large-scale 1B Word dataset, outperforming the baseline by over 5.6 points in perplexity.

Code: [![github](/images/github_icon.svg) zihangdai/mos](https://github.com/zihangdai/mos) + [![Papers with Code](/images/pwc_icon.svg) 8 community implementations](https://paperswithcode.com/paper/?openreview=HkwZSG-CZ)

Data: [Penn Treebank](https://paperswithcode.com/dataset/penn-treebank), [WikiText-2](https://paperswithcode.com/dataset/wikitext-2)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 9 code implementations](https://www.catalyzex.com/paper/breaking-the-softmax-bottleneck-a-high-rank/code)

23 Replies

Loading