EBGCG: Effective White-Box Jailbreak Attack Against Large Language Model

EBGCG: Effective White-Box Jailbreak Attack Against Large Language Model

ACL ARR 2024 June Submission3644 Authors

16 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) excel in tasks like question answering and text summarization, but they are vulnerable to jailbreak attacks, which trick them into generating illicit content. Current black-box methods are inefficient, while white-box methods face issues like slow convergence and suboptimal results. We propose EBGCG, **E**mbedding space pre-optimization and **B**eam search-enhanced **G**reedy **C**oordinate **G**radient, a novel two-stage white-box jailbreak attack method. The first stage uses gradient descent to pre-optimize adversarial suffixes in the embedding space. The second stage employs beam search-enhanced greedy coordination gradient, weighting tokens based on their positions to reduce distractions. Our evaluation shows EBGCG achieves an average attack success rate (ASR) of 69.47\%, outperforming GCG and BEAST by 16.68\% and 43.65\%, respectively, and reaching up to 87.12\% ASR on Falcon-7B-instruct.

Paper Type: Short

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: adversarial attacks

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 3644

Loading