Uncritical tokens are 'critical' in pretraining: the implicit regularization effect of Next token prediction
Keywords: Next token prediction, Reasoning capability of transformers, Large language model, Implicit regularization
Abstract: Next Token Prediction (NTP) is the prevailing pre-training approach for large language models, which have demonstrated remarkable reasoning capabilities. A key characteristic of NTP is its objective to predict every token in a sequence, including tokens that are not directly relevant to the final answer or core logic—often considered training noise. While such "noise" from uncritical tokens is traditionally thought to impair learning by introducing irrelevant information, our research reveals a counterintuitive positive effect. To isolate this phenomenon, we contrast NTP with Critical Token Prediction (CTP), a training paradigm that focuses exclusively on specific tokens such as the final answer.
Our findings show that NTP consistently surpasses CTP in reasoning ability. We hypothesize and substantiate through theoretical analysis that the learning objective on uncritical tokens acts as an implicit regularizer, analogous to explicit $L^2$ regularization. Further empirical analysis across various benchmark reasoning datasets confirms that NTP-trained models exhibit enhanced generalization and robustness, demonstrating greater resilience to perturbations and achieving flatter loss minima. These findings reveal that uncritical tokens are, in fact, 'critical' for developing robust reasoning during pre-training, offering valuable insights into optimizing training strategies for LLM development.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15736
Loading