Single-pass detection of jailbreaking input in large language models

Published: 04 Mar 2024, Last Modified: 14 Apr 2024SeT LLM @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Jailbreaking, jailbreaking defenses, thrustworthy LLMs
Abstract: Recent advancements have exposed the vulnerability of aligned large language models (LLMs) to jailbreaking attacks, which sparked a current wave of research on post-defense strategies. However, some existing approaches require either multiple requests to the models or additional auxiliary LLMs, which is time and resource-consuming. To this end, we propose single-pass detection, SPD, a method for detecting jailbreaking inputs via the logit values in a single forward pass. In open-source Lllama 2 and Vicuna, SPD achieves a higher attack detection rate and detection speed than the existing defense mechanisms with minimal misclassification of benign inputs. Finally, we demonstrate the efficacy of SPD even in the absence of full logit in both GPT-3.5 and GPT-4. We firmly believe that our proposed defense presents a promising approach to safeguarding LLMs against adversarial attacks.
Submission Number: 91
Loading