Dynamic Target Attack

Kedong Xiu; Churui Zeng; Tianhang Zheng; Xinzhe Huang; Xiaojun Jia; Di Wang; Puning Zhao; Zhan Qin; Kui Ren

Dynamic Target Attack

Kedong Xiu, Churui Zeng, Tianhang Zheng, Xinzhe Huang, Xiaojun Jia, Di Wang, Puning Zhao, Zhan Qin, Kui Ren

01 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language model, jailbreak attack, adversarial attack

Abstract: Existing gradient-based jailbreak attacks typically optimize an adversarial suffix to induce a fixed affirmative response, e.g., ``Sure, here is...''. However, this fixed target usually resides in an extremely low-density region of a safety-aligned LLM’s output distribution conditioned on diverse harmful inputs. Due to the substantial discrepancy between the target and the original output, existing attacks require numerous iterations to optimize the adversarial prompt, which might still fail to induce the low-probability target response from the target LLM. In this paper, we propose Dynamic Target Attack (DTA), a new jailbreaking framework relying on the target LLM's own responses as targets to optimize the adversarial prompts. In each optimization round, DTA iteratively samples multiple candidate responses directly from the output distribution conditioned on the current prompt, and selects the most harmful response as a temporary target for prompt optimization. In contrast to existing attacks, DTA significantly reduces the discrepancy between the target and the output distribution, substantially easing the optimization process to search for an effective adversarial prompt. Extensive experiments demonstrate the superior effectiveness and efficiency of DTA: under the white-box setting, DTA only needs $200$ optimization iterations to achieve an average attack success rate (ASR) of over $87$% on recent safety-aligned LLMs, exceeding the state-of-the-art baselines by over $15$%. The time cost of DTA is 2$\thicksim$26 times less than existing baselines. Under the black-box setting, DTA uses Llama-3-8B-Instruct as a surrogate model for target sampling and achieves an ASR of $85$% against the black-box target model Llama-3-70B-Instruct, exceeding its counterparts by over $25$%. All code and other materials are available here.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 437

Loading