Looking beyond the next token

ICLR 2026 Conference Submission21187 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data centric, Next token prediction, Controllable generation
TL;DR: Data centric, model agnostic approach to better planning and controllable generation.
Abstract: The most natural way to model language is rarely autoregressive. The structure of causal language model training assumes that each token can be predicted from prior context, a process that contrasts with humans’ natural writing and reasoning process, which is often non-linear and hierarchical. While this mismatch is well-documented, the working assumption has been that architectural changes are needed to address it. We argue that by simply rearranging and modifying the training data, models can more accurately imitate some aspects of the true data-generating process without any changes to the architecture or training infrastructure. We introduce Trelawney, a purely data-centric method that modifies the training data by interleaving sequences with special lookahead tokens that contain future information. This simple data augmentation, requiring no changes to model architecture or training infrastructure, equips models to both condition on future goals and generate them. We present representative results on high-entropy tasks like path planning, algorithmic reasoning, zebra puzzles, and controllable generation, demonstrating improved performance on tasks with branching paths or long-horizon planning. Finally, our method enables the generation of plausible long-term goals at no additional cost, potentially opening doors to new capabilities beyond the current language modeling paradigm.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 21187
Loading