Keywords: Jailbreak, LLM Safety
Abstract: Large Language Models (LLMs) have been used in a wide range of areas for their excellent performance in text generation. However, jailbreak attacks can circumvent the safety mechanisms of LLMs and lead to the generation of harmful or policy-violating content. In this paper, we first examine the performance of eight jailbreak attacks against an LLM-based filter, and find that attacks containing obvious anomalous patterns are easily detected, whereas those resembling normal requests are more likely to bypass the filter. Based on this finding, we conclude that being indistinguishable from benign requests is critical for successful jailbreak. This is because LLMs are trained with the objective of assisting benign requests, and rejecting requests indistinguishable from benign ones contradicts this objective. Considering that normal users often include detailed information when seeking help, we propose \textbf{Detail Guidance Attack (DGA)} that leverages the generation of details to imitate normal user patterns. We evaluate DGA on multiple LLMs across several datasets, and results show that DGA achieves strong jailbreak performance (achieving attack success rates over 95\% on GPT-4o, Gemini-2.5-flash and Qwen-3 on MaliciousInstruct, etc.). Since we reveal that jailbreak requests indistinguishable from benign ones can lead to severe harmful content generation, we collect some daily-life requests of this kind and conduct a user study to understand whether respondents expect LLMs to respond to them. Survey results show that respondents failed to reach a consensus on any of the requests, which indicates the difficulty of solving the safety-usability trade-off in reality, and highlights a significant gap between LLM safety research and real-world use.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 2027
Loading