Keywords: Next-Token Prediction, Large language models, Position-aware
Abstract: Next-token prediction (NTP) serves as the dominant training paradigm for large language models (LLMs), enabling strong autoregressive (AR) generation capabilities. Despite its success, models trained with vanilla NTP often exhibit counterintuitive failure patterns, such as the reversal curse, factorization curse, and sensitivity to knowledge position. These failures stem from the lack of permutation invariance in LLMs, which arises from the fixed left-to-right token order used during teacher-forcing supervision. To address this issue, we introduce a position-aware training framework that enables AR models to learn from all possible permutations of the sequence. We begin by introducing a position-aware embedding that enables LLMs to predict the next token not only based on the preceding context, but also by incorporating its position within the sequence. This embedding is integrated into LLMs through two complementary approaches: (1) Content-Position Coupling (CPC), which injects the embedding directly into the input embedding via element-wise addition, without altering the model architecture; and (2) Content-Position Decoupling (CPD), which adds modular position-aware blocks with a cross-attention mechanism on top of AR models. In this mechanism, the position-aware embedding serves as the query, while the hidden states from the final layer of the AR model serve as the key and value. Experiments across three representative tasks demonstrate that our framework consistently improves performance over strong baselines, while maintaining architectural simplicity and convergence efficiency. Codes are available at https://anonymous.4open.science/r/CPC-CPD.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 25482
Loading