Single-pass Detection of Jailbreaking Input in Large Language Models

Leyla Naz Candogan; Yongtao Wu; Elias Abad Rocamora; Grigorios Chrysos; Volkan Cevher

Single-pass Detection of Jailbreaking Input in Large Language Models

Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios Chrysos, Volkan Cevher

Published: 20 Feb 2025, Last Modified: 20 Feb 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just a forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.

Submission Length: Regular submission (no more than 12 pages of main content)

Supplementary Material: zip

Assigned Action Editor: ~Jonathan_Ullman1

Submission Number: 3682

Loading