Keywords: Softmax Bottleneck+, Transformer+, Output Projection Matrix+, Large Language Models+
TL;DR: We show that randomly initialized or trained output projection matrices can successfully produce exact probabilities for the top m tokens for rather large values of m.
Abstract: In many popular transformer architectures, an output projection matrix linearly maps lower-dimensional embeddings into a higher-dimensional space of logits. It has been shown that this leads to a softmax bottleneck that prevents the production of arbitrary probability distributions. It has been argued that this limits large language models (LLMs) in their ability to express next token probabilities that perfectly align with the statistics of natural language. We focus on the ability of such models to produce accurate probabilities for just the top-$m$ tokens. We provide theoretical bounds that show that even a randomly initialized projection matrix can successfully do this for rather large values of $m$, supported by empirical results on both random and trained matrices. This raises questions about whether the softmax bottleneck significantly limits the capabilities of LLMs. We also derive bounds on the maximum number of probabilities that any trained output projection matrix can specify.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9636
Loading