SkewAct: Red Teaming Large Language Models via Activation-Skewed Adversarial Prompt Optimization

Published: 09 Oct 2024, Last Modified: 03 Jan 2025Red Teaming GenAI Workshop @ NeurIPS'24 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Safety; LLM Red Teaming; Optimization; LLM Jailbreak
Abstract: Large Language Models (LLMs) have become increasingly impactful across various domains, including coding and data analysis. However, their widespread adoption has raised concerns about misuse, particularly in generating harmful or unethical content. Optimization-based jailbreaking techniques, a key component of LLM red teaming, aim to expose LLM vulnerabilities by inserting optimized adversarial triggers into prompts to elicit harmful outputs. Despite their potential, existing methods suffer from ineffectiveness and inefficiency due to the gap between the gradient-based candidate ranking and the discrete trigger update. In this paper, we present SkewAct, a novel optimization-based jailbreak framework designed to enhance both the efficacy and efficiency of adversarial prompt generation for better LLM red teaming. By utilizing gradients from both the original and an activation-perturbed target model—referred to as the skewed model—SkewAct identifies candidates that point toward the minima of the wide convex regions of the loss landscape. This approach preventing the optimization from bouncing between multiple local minima (i.e., gradient overshooting). Experimental results show that SkewAct improves the Attack Success Rate (ASR) by over 10% and reduce the converged loss with more than 12%, consistently outperforming GCG across seven LLMs with various safety levels, model architectures and model sizes.
Serve As Reviewer: guo778@purdue.edu
Submission Number: 45
Loading