Let's (not) just put things in Context: Test-time Training for Long-context LLMs

Rachit Bansal; Aston Zhang; Rishabh Tiwari; Lovish Madaan; Sai Surya Duvvuri; Fnu Devvrit; David Brandfonbrener; David Alvarez-Melis; Prajjwal Bhargava; Mihir Kale; Samy Jelassi

Let's (not) just put things in Context: Test-time Training for Long-context LLMs

Rachit Bansal, Aston Zhang, Rishabh Tiwari, Lovish Madaan, Sai Surya Duvvuri, Fnu Devvrit, David Brandfonbrener, David Alvarez-Melis, Prajjwal Bhargava, Mihir Kale, Samy Jelassi

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: long-context language models, test-time training, inference-time scaling

TL;DR: We study the limitations of vanilla inference-time scaling approaches for long-context large language models and study test-time training as a promising alternative.

Abstract: Advances in training and architectural design have enabled LLMs with million-token context windows, yet in practice these models often read far more than they can reliably use. While inference-time compute scaling—typically via “thinking tokens”—can help on short multi-step reasoning tasks, our controlled long-context experiments show rapidly diminishing returns that collapse as context grows. We trace this to score dilution in static self-attention and prove that, in such regimes, decoding more tokens cannot reliably recover buried evidence. We propose query-only test-time training (qTTT): a cache-preserving adaptation that performs a single prefill to fix keys/values and then applies a handful of gradient updates to the query projections. qTTT provably increases the target–distractor margin and, empirically, delivers consistent gains across model sizes and benchmarks. On Qwen3-4B, qTTT improves average accuracy by +12.6 and +14.1 absolute points on LongBench-v2 and ZeroSCROLLS, respectively. The practical takeaway is simple: for long contexts, spending a small inference-time budget on context-specific adaptation is a more effective use of compute than generating additional thinking tokens.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 8016

Loading