SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey; Eric Wong; Hamed Hassani; George J. Pappas

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey, Eric Wong, Hamed Hassani, George J. Pappas

Published: 15 Apr 2025, Last Modified: 15 Apr 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, an algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM offers improved robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: We included all of the revisions post rebuttal, plus the changes requested by the AE.

Code: https://github.com/arobey1/smooth-llm

Assigned Action Editor: ~Jiangchao_Yao1

Submission Number: 3905

Loading