Adaptive Jailbreak Defense: A Self-Evolving Framework for Large Language Models

Yongcan Yu; Yanbo Wang; Ran He; Jian Liang

Adaptive Jailbreak Defense: A Self-Evolving Framework for Large Language Models

Yongcan Yu, Yanbo Wang, Ran He, Jian Liang

17 Sept 2025 (modified: 09 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Jailbreak Defense, Multimodal Large Language Model

Abstract: While (multimodal) large language models (LLMs) have attracted widespread attention due to their exceptional capabilities, they remain vulnerable to jailbreak attacks. Various defense methods have been proposed to mitigate jailbreak attacks. These methods typically incorporate specific defense mechanisms into the model during training or deployment, aiming to enhance the LLM's robustness against jailbreak attacks in advance. However, as new jailbreak attack methods continue to emerge, defense methods with static resistance mechanisms can frequently be bypassed during testing. To address these limitations, we propose a defense framework, called Test-Time IMmunization (TTIM), which can adaptively defend against various jailbreak attacks through a self-evolving mechanism during testing. Specifically, TTIM first trains a gist token for efficient detection, which is subsequently employed to detect jailbreak activities during inference. When jailbreak attempts are detected, TTIM implements safety fine-tuning using the identified jailbreak instructions paired with refusal responses. Furthermore, to mitigate potential performance degradation of the detector caused by parameter updates during safety fine-tuning, we decouple the fine-tuning process from the detection module. Extensive experiments conducted on both LLMs and multimodal LLMs demonstrate that, starting from non-guarded models, TTIM effectively defends against various jailbreaks during testing with few jailbreak samples. Code is attached as supplementary material.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 8488

Loading