Between a Rock and a Hard Place: Exploiting Ethical Reasoning to Jailbreak LLMs

Between a Rock and a Hard Place: Exploiting Ethical Reasoning to Jailbreak LLMs

ACL ARR 2025 May Submission5814 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have undergone safety alignment efforts to mitigate harmful outputs. However, as LLMs become more sophisticated in reasoning, their intelligence may introduce new security risks. While traditional jailbreak attacks relied on single-step attacks, multi-turn jailbreak strategies that adapt dynamically to context remain underexplored. In this work, we introduce TRIAL (Trolley-problem Reasoning for Interactive Attack Logic), a framework that leverages LLMs ethical reasoning to bypass their safeguards. TRIAL embeds adversarial goals within ethical dilemmas modeled on the trolley problem. TRIAL demonstrates high jailbreak success rates towards both open and close-source models. Our findings underscore a fundamental limitation in AI safety: as models gain advanced reasoning abilities, the nature of their alignment may inadvertently allow for more covert security vulnerabilities to be exploited. TRIAL raises an urgent need in reevaluating safety alignment oversight strategies, as current safeguards may prove insufficient against context-aware adversarial attack.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: Ethics, Bias, and Fairness, Language Modeling, Question Answering

Languages Studied: English

Submission Number: 5814

Loading