Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

TMLR Paper4151 Authors

06 Feb 2025 (modified: 10 Apr 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Aligned large language models (LLMs) are vulnerable to jailbreaks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based attacks, there are no defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SemanticSmooth, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SemanticSmooth achieves strong robustness against both manually constructed jailbreak prompts and automatic jailbreak attacks like GCG, PAIR, and PromptRS while maintaining strong nominal performance on standard LLM evaluation benchmarks such as AlpacaEval for the instruction-following tasks and PiQA for the question-answering tasks.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Lingpeng_Kong1
Submission Number: 4151
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview