Value-Driven Jailbreak Attack Against Large Language Models

18 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Jailbreak Attack, AI Safety, Large Language Models
TL;DR: We propose a novel value-driven black-box jailbreak attack against large language models, which achieves state-of-the-art attack success rate (ASR) on JailbreakBench and the AdvBench subset.
Abstract: In the real world, the execution of a task often depends on the executor's recognition of its value. Inspired by this, we propose the value-driven jailbreak attack (VDJA), a simple yet effective black-box jailbreak method against large language models (LLMs). VDJA first exploits the phenomenon that LLMs tend to agree with humans to induce LLMs to affirm the moral value of harmful tasks, and then instructs them to perform the tasks, thereby achieving a jailbreak attack. Extensive experiments on five state-of-the-art (SOTA) LLMs demonstrate the superiority of VDJA. Within only one query and without concealing harmful instructions, VDJA achieves an average attack success rate (ASR) of 91.8\% on JailbreakBench and 95.2\% on the AdvBench subset. Remarkably, it achieves 100\% ASR against some of these LLMs on the AdvBench subset, showcasing SOTA jailbreak success rates and attack efficiency. Most importantly, our work reveals a novel vulnerability in the safety guardrails of LLMs, which highlights the urgent need to enhance their robustness.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 10199
Loading