Reinforcement Learning-powered Effectiveness and Efficiency Few-shot Jailbreaking Attack LLMs

Xuehai Tang, Zhongjiang Yao, Jie Wen, Yangchen Dong, Jizhong Han, Songlin Hu

Published: 2024, Last Modified: 15 Jan 2026ISPA 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The widespread use of large language models (LLMs) has brought about security risks, including biases, discrimination, and ethical concerns. Reinforcement Learning from Human Feedback (RLHF), as a method to improve model security, still faces challenges such as objective management and misaligned generalization, leading to the emergence of jailbreak attacks. Existing methods implement jailbreak attacks by optimizing adversarial prompts or leveraging the in-context learning capabilities of LLMs, but they are limited in terms of efficiency and scalability. This paper proposes a reinforcement learning-based few-shot example selection method to enhance the effectiveness and efficiency of these attacks. The proposed method extends the GPT-2 architecture with an example selection module and employs strategies such as experience replay and entropy penalty to accelerate convergence and avoid local optima. Experimental results demonstrate that, compared to existing methods, this approach achieves a 100% increase in attack success rate on Vicuna-7B and a 2.4-second reduction in the time cost per harmful instruction generation on GPT-3.5.

External IDs:dblp:conf/ispa/TangY0DHH24