CLAS-2024: The Evolution and Tactics of Jailbreaking Attacks

30 Oct 2024 (modified: 05 Nov 2024)THU 2024 Fall AML SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: jailbreak, safety, alignment, fine-tune
Abstract: This proposal investigates jailbreak attack methods to identify vulnerabilities in Large Language Models (LLMs). As LLMs become essential in sectors like healthcare, finance, and education, securing LLMs against malicious exploitation is critical. Our project, part of the CLAS-2024 Jailbreaking Track at NeurIPS 2024, examines both white-box and black-box attack strategies on models such as Llama-3 8B, GLM, Qwen and other LLMs. We introduce two attack methods: a fine-tuning approach for automated adversarial prompt generation and an accuracy-selective method to identify top-performing prompts. By simulating black-box environments and optimizing prompt injection techniques, we aim to maximize the effectiveness of jailbreak attacks, offering valuable insights for improving LLM safety.
Submission Number: 38
Loading