Keywords: language modeling, next token prediction, multi-token prediction, auxiliary loss
TL;DR: We propose Token Order Prediction (TOP), a simpler and more effective auxiliary objective than Multi-Token Prediction (MTP) that improves language model performance by learning to rank future tokens by proximity rather than predicting them exactly.
Abstract: Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We found MTP's exact future token prediction to be too difficult as an auxiliary loss. Instead, we propose Token Order Prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, and TOP objectives. The results of eight standard NLP benchmarks show that TOP overall outperforms both NTP and MTP even at scale. On the synthetic star graph task, TOP enables pathfinding on graphs where NTP and MTP fail.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 17542
Loading