Feedback-Driven Black-Box Safety Alignment Testing of Large Language Models via Reinforcement Learning

Published: 16 Jun 2026, Last Modified: 16 Jun 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) are equipped with safety alignment mechanisms to reduce harmful outputs, while systematically evaluating the effectiveness of these safeguards remains challenging. Existing methods mainly rely on manually curated prompts or stochastic mutation-based search, which provide limited exploration efficiency. We propose SEAT-RL, a feedback-driven black-box framework that uses deep reinforcement learning (DRL) to generate adversarial prompts against safety-aligned LLMs. We formulate prompt generation as a sequential decision-making problem, where an agent iteratively refines prompts based on target model feedback. To improve effectiveness and efficiency, we design (1) an LLM-facilitated action space that enables diverse yet constrained prompt transformations, and (2) a dense, automated reward function to guide exploration toward safety violations. The learned policy is reusable and transfers across target models without retraining. Experiments on six representative LLMs show that SEAT-RL discovers substantially more safety failures under the same query budget than existing automated baselines, such as the stochastic search methods powered by genetic algorithms. SEAT-RL also exhibits stronger stability, cross-model transferability, and robustness against multiple defense mechanisms. Ablation studies further validate the key design. These results suggest that RL provides an effective framework for black-box red-teaming evaluation of LLM safety alignment.
Certifications: J2C Certification
Submission Type: Regular submission (no more than 12 pages of main content)
Code: https://github.com/XuanChen-xc/SEAT-RL
Supplementary Material: zip
Assigned Action Editor: ~Hongyang_Zhang1
Submission Number: 8169
Loading