Abstract: In intelligent collaborative systems, the role of Large Language Models (LLMs) is becoming increasingly significant, with their security and privacy being of paramount importance. Greedy Coordinate Gradient (GCG)-based adversarial approaches are a staple in red team testing for circumventing the security alignments of LLMs. Yet, these methods are challenged by issues such as convergence difficulties and pseudo-evasion, which can impede attack efficacy. Our research indicates that the high likelihood of rejection tokens appearing in the initial k positions of generated text is a major contributor to adversarial failures, and their suppression can significantly improve attack success rates. Building on these insights, we present DS-GCG, an innovative adversarial attack methodology that enhances GCG attack potency. It employs adjustable-position prefilling to quell refusal responses and incite harmful outputs, coupled with a bidirectional greedy gradient search to swiftly identify adversarial suffixes. DS-GCG's universal suffix approach not only mitigates refusals but also hastens convergence, offering an efficient and robust search strategy. Our experimental results on widely-used open-source LLMs, showcased on the AdvBench dataset, confirm the cutting-edge performance of DS-GCG.
External IDs:dblp:conf/cscwd/TangYY0ZH025
Loading