Predicting the Order of Upcoming Tokens Improves Language Modeling

Predicting the Order of Upcoming Tokens Improves Language Modeling

ICLR 2026 Conference Submission17542 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: language modeling, next token prediction, multi-token prediction, auxiliary loss

TL;DR: We propose Token Order Prediction (TOP), a simpler and more effective auxiliary objective than Multi-Token Prediction (MTP) that improves language model performance by learning to rank future tokens by proximity rather than predicting them exactly.

Abstract: Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We found MTP's exact future token prediction to be too difficult as an auxiliary loss. Instead, we propose Token Order Prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, and TOP objectives. The results of eight standard NLP benchmarks show that TOP overall outperforms both NTP and MTP even at scale. On the synthetic star graph task, TOP enables pathfinding on graphs where NTP and MTP fail.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 17542

Loading