Test-Time Safety Alignment with Dynamic Intervention for Jailbreak Defense in LLMs

Shanwen Tan; Wei Ju; Hao Wu; Kun Wang; Yiwei Fu; Yifan Wang; Ziyue Qiao

Test-Time Safety Alignment with Dynamic Intervention for Jailbreak Defense in LLMs

Shanwen Tan, Wei Ju, Hao Wu, Kun Wang, Yiwei Fu, Yifan Wang, Ziyue Qiao

17 Sept 2025 (modified: 22 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: llm security, jailbreak defense, test-time alignment, safety-aware post-processing, efficient tree search thresholding

Abstract: This paper investigates the security of large language models (LLMs) in extended reasoning, with a particular focus on mitigating vulnerabilities such as jailbreak attacks. Existing approaches generally modify model parameters during training to inject secure behaviors into LLMs. However, such methods remain susceptible to various jailbreak attacks at test time and often perform poorly in security evaluations. To address these challenges, we propose an innovative framework named Test-time Security Alignment with Dynamic Intervention (TRADE) to directly mitigate jailbreak vulnerabilities during inference. Specifically, we introduce a reward-guided branch update module that advances the generation process using a multifurcation reward model, which evaluates multiple candidate tokens simultaneously. To further mitigate jailbreak attacks, we assess the final response with an additional safeguard model that enables safety-aware post-processing. If harmful content is detected, TRADE injects secure prompts and restarts the reward-guided generation phase with an efficient tree-search thresholding strategy. Extensive experiments on benchmark datasets have demonstrated the effectiveness of TRADE compared to existing LLM reasoning methods under jailbreak attack scenarios. Our code is available at https://anonymous.4open.science/r/TRADE-4DB3.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 9569

Loading