Implicit Optimization Bias of Next-token Prediction in Linear Models

Christos Thrampoulidis

Implicit Optimization Bias of Next-token Prediction in Linear Models

Christos Thrampoulidis

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: entropy, gradient descent, SVM, next-token prediction, linear models, separability, language models, word embeddings

TL;DR: By framing next-token prediction training as sparse soft-label classification, we characterize the implicit optimization biases of gradient descent in language models trained to achieve their entropy lower bound.

Abstract: We initiate an investigation into the optimization properties of next-token prediction (NTP), the dominant training paradigm for modern language models. Specifically, we study the structural properties of the solutions selected by gradient-based optimizers among the many possible minimizers of the NTP objective. By framing NTP as cross-entropy minimization across \emph{distinct} contexts, each tied with a \emph{sparse} conditional probability distribution across a finite vocabulary of tokens, we introduce ``NTP-separability conditions'' that enable reaching the data-entropy lower bound. With this setup, and focusing on linear models with fixed context embeddings, we characterize the optimization bias of gradient descent (GD): Within the data subspace defined by the sparsity patterns of distinct contexts, GD selects parameters that equate the logits' differences of in-support tokens to their log-odds. In the orthogonal subspace, the GD parameters diverge in norm and select the direction that maximizes a margin specific to NTP. These findings extend previous research on implicit bias in one-hot classification to the NTP setting, highlighting key differences and prompting further research into the optimization and generalization properties of NTP, irrespective of the specific architecture used to generate the context embeddings.

Primary Area: Natural language processing

Submission Number: 3502

Loading