Exploring The Effectiveness of Test Time Learning In LLMs for Long Contexts

Nizar Islah; Irina Rish; Eilif B. Muller

Exploring The Effectiveness of Test Time Learning In LLMs for Long Contexts

Nizar Islah, Irina Rish, Eilif B. Muller

Published: 23 Sept 2025, Last Modified: 11 Nov 2025CCFM PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: test-time learning, long context, stability-plasticity, retrieval, reasoning

TL;DR: We show that label-free test-time learning fails beyond 8K tokens, but a simple token-salience–weighted loss restores long-context performance at the cost of some forgetting.

Abstract: Foundation models must keep pace with a changing world, and test-time learning (TTL) promises fast, label-free updates that could make this possible, yet our study shows (a) where that promise breaks, and (b) how to rescue that even under the challenging long context setting. Furthermore, we characterize the effect of TTL on pretrained capabilities from a continual learning perspective via the plasticity–retention trade-off (in our experiments, RULER for long-context plasticity; 3 standard LM downstream tasks for retention). We uncover a sharp pattern: TTL reliably helps at short contexts but stalls or reverses from 8k to 32k sequence lengths, while base knowledge is largely preserved. However, we see a medium-strong (0.77) correlation between input perplexity and long-context plasticity. This connects test-time improvement on long contexts to a single, measurable quantity, and suggests that the TTL objective could be key to moving the needle further and should not be entirely thrown out. Our method, which relies on measuring each token's relevance and weighting the per-token losses, rescues the performance of TTL under longer, noisier contexts. This reframes negative TTL results not as failures of the overall approach but of assuming that all tokens contribute equally; when useful context is sparse, naive test-time updates cannot meaningfully improve the model. At the same time, our method decreases the stability of the model. This work contributes empirical results and a diagnostic that make these trade-offs evident, setting the stage for useful, frequent, and low-cost updates that keep models current without eroding base capabilities.

Serve As Reviewer: ~Matthew_Riemer1

Submission Number: 39

Loading