When Test-Time Training Fails: A Critical Analysis of Robustness and Hyperparameter Sensitivity

When Test-Time Training Fails: A Critical Analysis of Robustness and Hyperparameter Sensitivity

TMLR Paper6785 Authors

02 Dec 2025 (modified: 04 Dec 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Test-time training (TTT) through input perplexity minimization has emerged as a promising approach for enhancing language model performance during inference. However, questions remain about its practical robustness and applicability beyond popular benchmarks. This paper presents a preliminary analysis investigating two critical questions: whether TTT is effective on unseen tasks and how sensitive it is to hyperparameter choices. We evaluate TTT on three anti-memorization datasets—Memo-Trap, GSM-Symbolic, and Math-Perturb—using six models from the Qwen 2.5 and Llama 3 families. Our findings reveal that while TTT shows effectiveness on common benchmarks such as AIME 2024, it struggles with tasks designed to counter memorization, raising questions about whether the gains stem from domain adaptation or data contamination. We identify significant performance differences among optimizers, with SGD outperforming Adam despite slower convergence. Through extensive hyperparameter sweeps over learning rates, training steps, weight decay, momentum, and gradient normalization, we demonstrate that TTT is highly sensitive to these choices, with no universal recipe across tasks and models. Notably, gradient normalization emerges as an effective technique for improving robustness by mitigating catastrophic performance drops and reducing sensitivity to the learning rate. Our analysis also reveals that tuning feed-forward networks can achieve better peak performance than full model tuning, while attention-only tuning provides more stable worst-case performance. These findings highlight the need for continued research into making test-time training more practical and reliable for real-world deployment. Since this research only focuses on a specific algorithm of TTT: input perplexity minimization, our conclusion may not be applied to all TTT algorithms. We call on the community to pay closer attention to TTT's sensitivity to make it better suited for real-world applications

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Chuxu_Zhang2

Submission Number: 6785

Loading