DS-GCG: Enhancing LLM Jailbreaks with Token Suppression and Induction Dual-Strategy

Published: 2025, Last Modified: 15 Jan 2026CSCWD 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In intelligent collaborative systems, the role of Large Language Models (LLMs) is becoming increasingly significant, with their security and privacy being of paramount importance. Greedy Coordinate Gradient (GCG)-based adversarial approaches are a staple in red team testing for circumventing the security alignments of LLMs. Yet, these methods are challenged by issues such as convergence difficulties and pseudo-evasion, which can impede attack efficacy. Our research indicates that the high likelihood of rejection tokens appearing in the initial k positions of generated text is a major contributor to adversarial failures, and their suppression can significantly improve attack success rates. Building on these insights, we present DS-GCG, an innovative adversarial attack methodology that enhances GCG attack potency. It employs adjustable-position prefilling to quell refusal responses and incite harmful outputs, coupled with a bidirectional greedy gradient search to swiftly identify adversarial suffixes. DS-GCG's universal suffix approach not only mitigates refusals but also hastens convergence, offering an efficient and robust search strategy. Our experimental results on widely-used open-source LLMs, showcased on the AdvBench dataset, confirm the cutting-edge performance of DS-GCG.
Loading