Next-ToBE: Probabilistic Next Token-Bag Exploitation for Activating Anticipatory Capacity in LLMs

Yihe Liu; Huibin Wang; Xianming Hu; Pinyi Zhang; Jiahao Xiong; Chenglin Wang; Nuoyi Chen; Hongbo Zhao; Jie Zhang; Kai Zhang

Next-ToBE: Probabilistic Next Token-Bag Exploitation for Activating Anticipatory Capacity in LLMs

Yihe Liu, Huibin Wang, Xianming Hu, Pinyi Zhang, Jiahao Xiong, Chenglin Wang, Nuoyi Chen, Hongbo Zhao, Jie Zhang, Kai Zhang

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models, anticipatory capacity

TL;DR: Next-ToBE introduces a soft-target distribution that activates and refines anticipatory capacity in LLMs, improving reasoning performance by looking ahead beyond the immediate next token.

Abstract: Auto-regressive large language models (LLMs) exhibit a non-trivial capacity to "anticipate'' long-range future tokens despite being trained to predict only one token at a time. Nevertheless, how to systematically profile, enhance and leverage such capacity to practically improve LLM reasoning performance remains unclear. In this paper, we propose **Next Token-Bag Exploitation (Next-ToBE)** to tackle this challenge. Next-ToBE quantifies LLM’s anticipatory capacity by measuring how well tokens in the future window are pre-captured by the model’s current softmax probabilities. This capacity is strongly correlated with LLM generative quality but often suppressed by the rigid one-hot objective in next-token prediction. To address this, we replace the {one-hot target vector} in next-token prediction with a soft target distribution spanning additional future tokens. Specifically, the immediate next token retains the highest importance, while more distant ``look-ahead tokens'' are also included to enrich supervision, with their importance dynamically determined by temporal and semantic relevance patterns to inject forward-looking pressure. Besides, the fitting process emphasizes the model’s intrinsic anticipatory tendency, thus preserving the confidence and fidelity of the pre-trained model to improve training stability. Overall, Next-ToBE not only effectively activates LLM anticipatory capacity through fine-tuning, yielding notable gains in reasoning performance with higher memory and computational efficiency against the MTP baselines, but also shows great potential in pretraining setting by successfully cultivating this capacity from scratch. These highlight its value as an effective strategy to extend the prediction horizon of LLMs, enabling them to see further, and reason better.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 11498

Loading