Indistinguishability from Benign Requests Enables Jailbreak Success

Wenzhuo Xu; Zhipeng Wei; Zonghao Ying; Deyue Zhang; Dongdong Yang; Xiangzheng Zhang; Quanchen Zou

Indistinguishability from Benign Requests Enables Jailbreak Success

Wenzhuo Xu, Zhipeng Wei, Zonghao Ying, Deyue Zhang, Dongdong Yang, Xiangzheng Zhang, Quanchen Zou

04 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Jailbreak, LLM Safety

Abstract: Large Language Models (LLMs) have been used in a wide range of areas for their excellent performance in text generation. However, jailbreak attacks can circumvent the safety mechanisms of LLMs and lead to the generation of harmful or policy-violating content. In this paper, we first examine the performance of eight jailbreak attacks against an LLM-based filter, and find that attacks containing obvious anomalous patterns are easily detected, whereas those resembling normal requests are more likely to bypass the filter. Based on this finding, we conclude that being indistinguishable from benign requests is critical for successful jailbreak. This is because LLMs are trained with the objective of assisting benign requests, and rejecting requests indistinguishable from benign ones contradicts this objective. Considering that normal users often include detailed information when seeking help, we propose \textbf{Detail Guidance Attack (DGA)} that leverages the generation of details to imitate normal user patterns. We evaluate DGA on multiple LLMs across several datasets, and results show that DGA achieves strong jailbreak performance (achieving attack success rates over 95\% on GPT-4o, Gemini-2.5-flash and Qwen-3 on MaliciousInstruct, etc.). Since we reveal that jailbreak requests indistinguishable from benign ones can lead to severe harmful content generation, we collect some daily-life requests of this kind and conduct a user study to understand whether respondents expect LLMs to respond to them. Survey results show that respondents failed to reach a consensus on any of the requests, which indicates the difficulty of solving the safety-usability trade-off in reality, and highlights a significant gap between LLM safety research and real-world use.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 2027

Loading