Test-Time Self-Distillation

Published: 01 Mar 2026, Last Modified: 07 Mar 2026TTU at ICLR 2026 (Main) OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We study the discovery problem in difficult binary-reward tasks, where the goal is to find a solution in as little attempts as possible. Whereas prior approaches rely on repeatedly sampling from the base model reflecting on past failures, we introduce a Test-Time Training (TTT) method that enables the model to continue learning at test-time well beyond its limited context length. Previous applications of TTT to discovery problems have been limited to continuous rewards, since they allow hill-climbing on suboptimal solutions with scalar rewards. This fails in binary-reward tasks where the reward only provides a learning signal once a solution has already been found. We introduce Test-Time Self-Distillation, which converts environment feedback on past failures into dense learning signals through Self-Distillation Policy Optimization (SDPO). SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. On difficult competitive programming problems from LiveCodeBench, test-time self-distillation achieves the same discovery probability as best-of-$k$ sampling or multi-turn conversations with $3\times$ fewer attempts.
Submission Number: 37
Loading