IT$^3$: Idempotent Test-Time Training

Nikita Durasov; Assaf Shocher; Doruk Oner; Gal Chechik; Alexei A Efros; Pascal Fua

IT$^3$: Idempotent Test-Time Training

Nikita Durasov, Assaf Shocher, Doruk Oner, Gal Chechik, Alexei A Efros, Pascal Fua

24 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: idempotence;generalization

TL;DR: Adapting to distribution shifts at test time by training the model to be idempotent.

Abstract: This paper introduces Idempotent Test-Time Training (IT$^3$), a novel approach to addressing the challenge of distribution shift. While supervised-learning methods assume matching train and test distributions, this is rarely the case for machine learning systems deployed in the real world. Test-Time Training (TTT) approaches address this by adapting models during inference, but they are limited by a domain specific auxiliary task. IT$^3$ is based on the universal property of idempotence. An idempotent operator is one that can be applied sequentially without changing the result beyond the initial application, namely $f(f(x))=f(x)$. An idempotent operator is one that can be applied sequentially without changing the result beyond the initial application, that is $f(f(X)=f(X)$. At training, the model receives an input $X$ along with another signal that can either be the ground truth label $y$ or a neutral "don't know" signal $\mathbf{0}$. At test time, the additional signal can only be $\mathbf{0}$. When sequentially applying the model, first predicting $y_0 = f(X, \mathbf{0})$ and then $y_1 = f(X, y_0)$, the distance between $y_1$ and $y_2$ measures certainty and indicates out-of-distribution input $x$ if high. We use this distance, that can be expressed as $||f(X, f(X, \mathbf{0})) - f(x, \mathbf{0})||$ as our TTT loss during inference. By carefully optimizing this objective, we effectively train $f(X,\cdot)$ to be idempotent, projecting the internal representation of the input onto the training distribution. We demonstrate the versatility of our approach across various tasks, including corrupted image classification, aerodynamic predictions, tabular data with missing information, and large-scale aerial photo segmentation. Moreover, these tasks span different architectures such as MLPs, CNNs, and GNNs.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4015

Loading