Test-Time Self-Distillation

Jonas Hübotter; Frederike Lübeck; Lejs Deen Behric; Anton Baumann; Marco Bagatella; Daniel Marta; Ido Hakimi; Idan Shenfeld; Thomas Kleine Buening; Carlos Guestrin; Andreas Krause

Test-Time Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Deen Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, Andreas Krause

Published: 01 Mar 2026, Last Modified: 05 Apr 2026TTU at ICLR 2026 (Main) OralEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We study the discovery problem in difficult binary-reward tasks, where the goal is to find a solution in as little attempts as possible. Whereas prior approaches rely on repeatedly sampling from the base model reflecting on past failures, we introduce a Test-Time Training (TTT) method that enables the model to continue learning at test-time well beyond its limited context length. Previous applications of TTT to discovery problems have been limited to continuous rewards, since they allow hill-climbing on suboptimal solutions with scalar rewards. This fails in binary-reward tasks where the reward only provides a learning signal once a solution has already been found. We introduce Test-Time Self-Distillation, which converts environment feedback on past failures into dense learning signals through Self-Distillation Policy Optimization (SDPO). SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. On difficult competitive programming problems from LiveCodeBench, test-time self-distillation achieves the same discovery probability as best-of-$k$ sampling or multi-turn conversations with $3\times$ fewer attempts.

Submission Number: 37

Loading