Keywords: Adversarial attack, Large Language Model, Jailbreak defense, Prompt optimization
Abstract: Large Language Models (LLMs) exhibit remarkable capabilities but remain susceptible to adversarial attacks aimed at eliciting harmful responses. Current defense mechanisms, while partially effective, often compromise the models' performance on benign tasks. To address this issue, we propose a novel two-stage framework to enhance LLM robustness against adversarial attacks. Initially, we train a universal adversarial prompt designed to align the hidden representations of harmful inputs with benign ones, effectively emulating adversarial attack patterns. Subsequently, we utilize this soft prompt to augment training data, strengthening the model's capability to reject harmful content. We conduct extensive evaluations to empirically demonstrate that our method outperforms existing methods such as Circuit Breaker (CB) and Deliberative Alignment SFT (DSFT), achieving average defense rate of 96.4\% on llama3-8b-instruct model, compared to CB (90.6\%) and DSFT (87.9\%).
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: safety and alignment
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7301
Loading