Implicit Optimization Bias of Next-token Prediction in Linear Models

Published: 18 Jun 2024, Last Modified: 03 Jul 2024TF2M 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: next-token prediction, optimization, entropy, implicit regularization, max-margin
Abstract: Next-token prediction (NTP) has become the go-to training paradigm for modern language models, yet its optimization principles are not well-understood. To bridge this gap, we initiate a study of the structural properties of the solutions selected by gradient-based optimizers among the many possible minimizers of the NTP objective. By framing NTP as cross-entropy minimization across \emph{distinct} contexts, each tied with a \emph{sparse} conditional probability distribution across a finite vocabulary of tokens, we introduce ``NTP-separability conditions'' that enable reaching the entropy lower bound. With this setup, we then focus on linear models, for which we characterize the optimization bias of gradient descent. Extending previous research on implicit bias in one-hot classification to the NTP setting, highlights key differences and prompts further research into optimization and generalization of NTP.
Submission Number: 70
Loading