Keywords: Jailbreak Attack; Large Language Model
Abstract: Large language models (LLMs) are increasingly deployed in safety-critical domains, yet their alignment with ethical constraints remains fragile, particularly when prompts require structured reasoning. We uncover a vulnerability, Reasoning Against Alignment, where LLMs generate harmful content not through misunderstanding but as the logically coherent outcome of multi-step inference.
Through black-box and white-box analyses across both commercial and open-source LLMs, we show that logically reframed prompts cause models to prioritize internal coherence over moral safeguards. Token-level traces reveal that refusal signals diminish while harmful semantics gradually emerge, a process that is not captured by surface-level rejection metrics.
To study this vulnerability, we introduce Reasoning Logic Jailbreaking (ReLoK), a single-turn attack that reframes unsafe requests as abstract viewpoints and decomposes sensitive terms. We evaluate ReLoK on five representative LLMs including ChatGPT-4o, Gemini 2.0 Flash, Claude 3.7 Sonnet, DeepSeek-R1-671B, and QwQ-32B using three jailbreak datasets. It achieves an average attack success rate of 97.9\%, highlighting the practical severity and broad applicability of the vulnerability.
Our findings suggest that alignment strategies must address not only what LLMs output but also how they reason. We advocate for reasoning-aware safety mechanisms such as ethical inference supervision and trajectory-level risk detection. Our code and data are available at https://anonymous.4open.science/r/Reasoning-Against-Alignment-7FD4.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 8860
Loading