Keywords: llm security, jailbreak defense, test-time alignment, safety-aware post-processing, efficient tree search thresholding
Abstract: This paper investigates the security of large language models (LLMs) in extended reasoning, with a particular focus on mitigating vulnerabilities such as jailbreak attacks. Existing approaches generally modify model parameters during training to inject secure behaviors into LLMs. However, such methods remain susceptible to various jailbreak attacks at test time and often perform poorly in security evaluations. To address these challenges, we propose an innovative framework named Test-time Security Alignment with Dynamic Intervention (TRADE) to directly mitigate jailbreak vulnerabilities during inference. Specifically, we introduce a reward-guided branch update module that advances the generation process using a multifurcation reward model, which evaluates multiple candidate tokens simultaneously. To further mitigate jailbreak attacks, we assess the final response with an additional safeguard model that enables safety-aware post-processing. If harmful content is detected, TRADE injects secure prompts and restarts the reward-guided generation phase with an efficient tree-search thresholding strategy. Extensive experiments on benchmark datasets have demonstrated the effectiveness of TRADE compared to existing LLM reasoning methods under jailbreak attack scenarios. Our code is available at https://anonymous.4open.science/r/TRADE-4DB3.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 9569
Loading