Self-Improving LLM Agents at Test-Time

ICLR 2026 Conference Submission21886 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Test-Time Training, Self-Improvement, Language Agents
TL;DR: We introduce a test-time self-improvement algorithm where agents detect uncertain test samples they struggle with, generate new examples from them, and use these at test-time fine-tuning, achieving higher accuracy with far fewer samples.
Abstract: One paradigm of language model (LM) fine-tuning relies on creating large training datasets, under the assumption that high quantity and diversity will enable models to generalize to novel tasks after post‑training. In practice, gathering large sets of data is inefficient, and training on them is prohibitively expensive; worse, there is no guarantee that the resulting model will handle complex scenarios or generalize better. Moreover, existing techniques rarely assess whether a training sample provides novel information or is redundant with the knowledge already acquired by the model, resulting in unnecessary costs. In this paper, we explore a new test-time self-improvement method to create more effective and generalizable agentic LMs *on-the-fly*. The proposed algorithm can be summarized in three steps: (i) first it identifies the samples that the model struggles with by using an uncertainty function (self-awareness), (ii) then generates similar examples from the detected uncertain samples (self-data augmentation), and (iii) uses these newly generated samples at test-time fine-tuning (self-learning). We study two variants of this approach: *Test-Time Self-Improvement* (TT-SI), where the same model generates additional training examples from its own uncertain cases and then learns from them, and contrast this approach with *Test-Time Distillation* (TT-D), where a stronger model generates similar examples for those same uncertain cases, enabling the student to adapt using distilled supervision. Empirical evaluations across different agent benchmarks demonstrate that TT-SI surpasses other standard learning methods with +5.36% absolute gain in average accuracy, yet trains using 68x less training samples and TT-D further improves performance in harder scenarios that require diverse training signals. Our findings highlight the promise of TT-SI and limitations in current learning frameworks regarding cost and generalizability, demonstrating the potential of self-evolving LMs at test-time as a new paradigm for building more capable agents on complex scenarios.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 21886
Loading