Keywords: Next-Token Prediction, Large language models, Position-aware
Abstract: Next-token prediction (NTP) serves as the dominant training paradigm for large language models (LLMs), enabling strong autoregressive (AR) generation capabilities. Despite its success, models trained with vanilla NTP often exhibit counterintuitive failure patterns, such as the reversal curse, factorization curse, and sensitivity to knowledge position. These failures stem from the fixed left-to-right token order during teacher-forcing supervision, which entangle content and token order in ways that compromise permutation invariance. To address these failures, we introduce a position-aware training framework that enables AR models to predict the next token not just based on seen content, but also to account for predicted token position. This disentanglement of what to predict and where to predict improves the robustness of LLMs to different token orderings. We instantiate this framework via two complementary approaches: (1) Content-Position Coupling (CPC), which injects a lightweight position-aware embedding into the input sequence without modifying the model architecture; and (2) Content-Position Decoupling (CPD), which introduces the modular position-aware blocks for the pre-training AR model to provide explicit supervision over target positions. Experiments across three representative tasks demonstrate that our framework consistently improves performance over strong baselines, while maintaining architectural simplicity and convergence efficiency. Codes are available at {\url{https://anonymous.4open.science/r/CPC-CPD}}.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 25482
Loading