Keywords: next-token prediction, entropy lower bound, SVM, word embeddings
TL;DR: Theoretical study of the implicit regulariazation of the next-token prediction training objective: when does the loss reach the lower bound and what is the structure of learned parameters?
Abstract: Next-token prediction (NTP) has become the go-to training paradigm for modern language models, yet its optimization principles are not well-understood. To bridge this gap, we initiate a study of the structural properties of the solutions selected by gradient-based optimizers among the many possible minimizers of the NTP objective. By framing NTP as cross-entropy minimization across \emph{distinct} contexts, each tied with a \emph{sparse} conditional probability distribution across a finite vocabulary of tokens, we introduce ``NTP-separability conditions'' that enable reaching the entropy lower bound. With this setup, we then focus on linear models, for which we characterize the optimization bias of gradient descent. Extending previous research on implicit bias in one-hot classification to the NTP setting, highlights key differences and prompts further research into optimization and generalization of NTP.
Submission Number: 21
Loading