Abstract: Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just a forward pass.SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jonathan_Ullman1
Submission Number: 3682
Loading