Keywords: Multi-Turn Jailbreak Defense, Large Language Model Safety, Bidirectional Intention Inference
Abstract: The remarkable capabilities of Large Language Models (LLMs) have raised significant safety concerns, particularly regarding “jailbreak” attacks. While current defense research focuses on single-turn attacks, multi-turn jailbreaks attacks circumvent conventional safeguards via progressive intent concealment and tactical manipulation. To address this critical challenge, we propose the Bidirectional Intention Inference Defense (BIID). The method combines forward request-based intention and backward response-based intention retrospection to uncover concealed risks, effectively preventing harmful content generation. The proposed method undergoes systematic evaluation compared with 8 baselines across 2 LLMs and 2 safety benchmarks under 10 different attack methods. Experimental results demonstrate that the proposed method significantly reduces the Attack Success Rate (ASR), outperforming all 8 baselines while effectively maintaining practical utility. Notably, comparative experiments across 3 multi-turn safety datasets further validate our method's significant advantages over other defense approaches.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Language Modeling: safety and alignment
Contribution Types: Model analysis & interpretability, Reproduction study, Data analysis
Languages Studied: English
Submission Number: 9415
Loading