Keywords: long-context language models, test-time training, inference-time scaling
TL;DR: We study the limitations of vanilla inference-time scaling approaches for long-context large language models and study test-time training as a promising alternative.
Abstract: Advances in training and architectural design have enabled LLMs with million-token context windows, yet in practice these models often read far more than they can reliably use. While inference-time compute scaling—typically via “thinking tokens”—can help on short multi-step reasoning tasks, our controlled long-context experiments show rapidly diminishing returns that collapse as context grows. We trace this to score dilution in static self-attention and prove that, in such regimes, decoding more tokens cannot reliably recover buried evidence. We propose query-only test-time training (qTTT): a cache-preserving adaptation that performs a single prefill to fix keys/values and then applies a handful of gradient updates to the query projections. qTTT provably increases the target–distractor margin and, empirically, delivers consistent gains across model sizes and benchmarks. On Qwen3-4B, qTTT improves average accuracy by +12.6 and +14.1 absolute points on LongBench-v2 and ZeroSCROLLS, respectively. The practical takeaway is simple: for long contexts, spending a small inference-time budget on context-specific adaptation is a more effective use of compute than generating additional thinking tokens.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 8016
Loading