Future-Gain Guided Test-Time Learning for Large Language Models

LangYu Bian; Jinwu Hu; Zitian Zhang; Dongjin Yang; Yufeng Wang; Qing Du; Qi Chen; Mingkui Tan

Future-Gain Guided Test-Time Learning for Large Language Models

LangYu Bian, Jinwu Hu, Zitian Zhang, Dongjin Yang, Yufeng Wang, Qing Du, Qi Chen, Mingkui Tan

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: This paper proposes a safer way for large language models to adapt at test time by learning from the most informative parts of their own outputs.

Abstract: Large language models (LLMs) inevitably encounter distribution shifts during real-world deployment, leading to performance degradation. Although test-time learning (TTL) adapts LLMs from unlabeled test streams, applying entropy minimization to autoregressive generation faces two challenges: (i) early decoding errors can steer later tokens off track, and updating on them can push the model further off course, and (ii) updates on unreliable tokens can amplify confident error predictions and trigger model collapse. To address these challenges, we propose Future-Gain Guided Test-Time Learning (FG-TTL) for LLMs, which learns selectively from the model's own generations. Our key idea is to update only on tokens that reduce uncertainty in subsequent generation rather than tokens that are merely uncertain at the current step. Specifically, we develop a Future-Gain Guided Token Selection (FTS) strategy to decide where to learn. We introduce Future-Gain as a token-level metric for this purpose and update the model only on high-gain tokens, concentrating learning on informative positions and mitigating temporal error propagation. In addition, we design a Risk-Aware Adaptation (RAA) mechanism that controls how strongly to update by combining gain-based weighting with adaptive temperature scaling based on intrinsic uncertainty, suppressing unreliable gradients while enabling stronger learning on high-gain tokens. Experiments on six benchmarks with three LLM backbones show that FG-TTL achieves the best average performance.

Lay Summary: Large language models can perform well on many tasks, but their performance may drop when they are used in new situations, such as unfamiliar domains, changing user needs, or new ways of asking questions. One possible solution is to let the model keep learning while it is being used, even without human-provided answers. However, this can be risky: if the model learns from unreliable parts of its own responses, it may reinforce its mistakes. This paper proposes FG-TTL, a safer way for large language models to learn during use. Instead of learning from every part of a generated response, the method focuses on the parts that make the following parts easier for the model to predict. It also learns more cautiously when the model appears unsure, reducing the chance of learning from unreliable signals. Experiments on several reasoning and domain-specific tasks show that this approach improves different large language models, helping them adapt more reliably to new situations.

Link To Code: https://github.com/BianLangyu/FG-TTL.git

Primary Area: Deep Learning->Large Language Models

Keywords: Test-Time Learning, Large Language Models

Originally Submitted PDF: pdf

Submission Number: 3423

Loading