ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe Prompts

ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe Prompts

ACL ARR 2025 May Submission5338 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Existing LLM red-teaming approaches prioritize high attack success rate, often resulting in high-perplexity prompts. This focus overlooks low-perplexity attacks that are more difficult to filter, more likely to arise during benign usage, and more impactful as negative downstream training examples. In response, we introduce ASTPrompter, a single-step optimization method that uses contrastive preference learning to train an attacker to maintain low perplexity while achieving a high attack success rate (ASR). ASTPrompter achieves an attack success rate $5.1$ times higher on Llama-8.1B while using inputs that are $2.1$ times more likely to occur according to the frozen LLM. Furthermore, our attack transfers to Mistral-7B, Qwen-7B, and TinyLlama in both black- and white-box settings. Lastly, by tuning a single hyperparameter in our method, we discover successful attack prefixes along an efficient frontier between ASR and perplexity, highlighting perplexity as a previously under-considered factor in red-teaming.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: red teaming, prompting, safety and alignment

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English, Web Text

Submission Number: 5338

Loading