Keywords: Data Mining, LLMs, Jailbreaking
Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities across a wide range of tasks, yet they remain susceptible to jailbreaking attacks, where adversarial prompts are crafted to bypass the models' safety mechanisms and generate harmful outputs. Traditional jailbreaking methods often rely on static templates or inefficient processes, resulting in limited diversity and effectiveness. In this paper, we introduce ExpeAttack, a novel framework designed to enhance the efficiency and diversity of jailbreaking prompts through a dynamic, experience-driven approach. ExpeAttack operates in two stages: seed generation and iterative refinement. In the seed generation phase, a Pattern Factory is employed to create diverse initial prompts by integrating various attack strategies, such as role-playing and semantic inversion. The refinement phase utilizes a combination of short-term and long-term memory pools, along with an insight-based memory compression mechanism, to distill successful attack patterns into transferable meta-instructions. This process enables efficient and interpretable refinement of attack samples. Our experiments across multiple LLMs demonstrate that ExpeAttack achieves high attack success rates while maintaining computational efficiency and generating a diverse array of jailbreak prompts. This work not only highlights the vulnerabilities of current LLMs but also provides insights into developing more robust and secure AI systems.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 12792
Loading