Test-Time Training Undermines Existing Safety Guardrails

Published: 02 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Trustworthy AIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Test-Time Training, Jailbreak, LLM attacks
Abstract: Test-Time Training (TTT) is an emerging paradigm that enables models to adapt their parameters during inference. TTT has been shown to be useful in several scenarios, including few-shot learning, retrieval-augmented models, and improving performance on complex reasoning tasks. However, this dynamic adaptation introduces new vulnerabilities that adversaries can exploit to jailbreak models. In this work, we identify several new threat models for TTT and demonstrate how attackers can leverage these settings to bypass safety filters. Our results show that TTT consistently improves the Attack Success Rate (ASR). These findings suggest that TTT exposes a new attack surface, strengthens attacks, and undermines existing safety guardrails. Thus, we argue that establishing additional safety guidelines is essential for the secure deployment of TTT in real-world applications.
Submission Number: 306
Loading