Enhancing Safety Alignment by Universal Adversarial Prompts

Enhancing Safety Alignment by Universal Adversarial Prompts

ACL ARR 2026 January Submission7301 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Adversarial attack, Large Language Model, Jailbreak defense, Prompt optimization

Abstract: Large Language Models (LLMs) exhibit remarkable capabilities but remain susceptible to adversarial attacks aimed at eliciting harmful responses. Current defense mechanisms, while partially effective, often compromise the models' performance on benign tasks. To address this issue, we propose a novel two-stage framework to enhance LLM robustness against adversarial attacks. Initially, we train a universal adversarial prompt designed to align the hidden representations of harmful inputs with benign ones, effectively emulating adversarial attack patterns. Subsequently, we utilize this soft prompt to augment training data, strengthening the model's capability to reject harmful content. We conduct extensive evaluations to empirically demonstrate that our method outperforms existing methods such as Circuit Breaker (CB) and Deliberative Alignment SFT (DSFT), achieving average defense rate of 96.4\% on llama3-8b-instruct model, compared to CB (90.6\%) and DSFT (87.9\%).

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: safety and alignment

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 7301

Loading