Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks

ACL ARR 2026 January Submission9415 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-Turn Jailbreak Defense, Large Language Model Safety, Bidirectional Intention Inference
Abstract: The remarkable capabilities of Large Language Models (LLMs) have raised significant safety concerns, particularly regarding “jailbreak” attacks. While current defense research focuses on single-turn attacks, multi-turn jailbreaks attacks circumvent conventional safeguards via progressive intent concealment and tactical manipulation. To address this critical challenge, we propose the Bidirectional Intention Inference Defense (BIID). The method combines forward request-based intention and backward response-based intention retrospection to uncover concealed risks, effectively preventing harmful content generation. The proposed method undergoes systematic evaluation compared with 8 baselines across 2 LLMs and 2 safety benchmarks under 10 different attack methods. Experimental results demonstrate that the proposed method significantly reduces the Attack Success Rate (ASR), outperforming all 8 baselines while effectively maintaining practical utility. Notably, comparative experiments across 3 multi-turn safety datasets further validate our method's significant advantages over other defense approaches.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Language Modeling: safety and alignment
Contribution Types: Model analysis & interpretability, Reproduction study, Data analysis
Languages Studied: English
Submission Number: 9415
Loading