Keywords: open-source, LLM, safety alignment, prefilling attacks, exploit
TL;DR: We find that the safety alignment of open-source models can be easily broken, and propose a novel, strong defense.
Abstract: Open-source Large Language Models (LLMs) play a critical role in the democratization of AI, yet their "open" nature introduces more avenues for malicious actors to misuse them for harmful purposes. A frustratingly easy but powerful technique known as the prefilling attack has been shown to effectively circumvent the safety alignment of frontier open-source LLMs by simply prefilling the assistant response with an affirmative prefix before decoding. A recent promising supervised fine-tuning defense proposes using a simple data augmentation scheme to achieve a "deep" safety alignment, allowing the model to generate natural language refusals immediately following harmful prefills. In this work, we show that a simple generalization of the prefilling attack, which we refer to as the Rank-Assisted Prefilling (RAP) attack, can effectively extract harmful content from models fine-tuned with the data augmentation defense by selecting low-probability "harmful" tokens from the top 20 predicted next tokens at each step (thus ignoring high-probability "refusal" tokens). We then propose a new perspective on achieving deep safety alignment by matching the token ranks in the underlying data augmentation target distribution (rather than just their probabilities), yielding a surprisingly simple approach to strengthening deep alignment we call PRefill attEntion STOpping (PRESTO) that regularizes the attention placed on harmful prefill tokens. PRESTO yields up to a 4.7x improvement in the mean StrongREJECT score under RAP attacks across three popular open-source LLMs. By achieving a stronger level of safety against practical and accessible attacks, our work paves a path towards safer open-source models.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 515
Loading