A Collaborative Multi-Agent Framework for Jailbreaking with RL-Based Dynamic Prompting

A Collaborative Multi-Agent Framework for Jailbreaking with RL-Based Dynamic Prompting

AAAI 2026 Workshop TrustAgent Submission15 Authors

Published: 20 Nov 2025, Last Modified: 09 Mar 2026AAAI 2026 TrustAgent Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Trust Agents, LLM Evaluation Framework, and LLM Safety

TL;DR: We propose a dense-reward RL framework coupling a Soft Actor–Critic attacker, rewriter, and judge to evaluate and improve the trustworthiness of agentic LLMs through interpretable, query-efficient jailbreak analysis.

Abstract: Evaluating large language models (LLMs) under adversarial prompting remains difficult: heuristic and genetic red-teaming pipelines require heavy hand-tuning and are query-inefficient, while recent reinforcement learning based attacks often optimize sparse or binary rewards (e.g., pass/fail, cosine similarity), yielding unstable training and low diversity. In this paper, we present a rating-based adversarial RL framework that formulates jailbreak discovery as dense-reward optimization. The proposed framework closes the loop among (i) a Soft Actor--Critic (SAC) agent attacker with hybrid discrete--continuous actions (operator family + style sliders), (ii) a controllable rewriter LLM that preserves intent while injecting surface diversity and stealth, and (iii) a calibrated judge LLM that assigns five absolute ratings---\emph{Success, Stealth, Novelty, Efficiency, Impact}. A curriculum-weighted aggregation converts these ratings into a continuous reward, and stratified replay, early-exit, and de-duplication further improve sample and query efficiency. We instantiate the judge as LLaMA-3-Instruct and the rewriter as Yi-9B, and evaluate across three target models (LLaMA-3, Qwen-2.5-7B, Mistral-7B) using the SORRY-Bench dataset for training seeds/priors and the out-of-distribution harmful subset of the JailbreakBench dataset for testing. Empirically, our framework achieves up to a 15\% higher attack success rate while maintaining greater prompt novelty and stealth. These results indicate that dense, interpretable rating signals paired with off-policy optimization provide a scalable foundation for safety-aligned, query-efficient jailbreak evaluation.

Submission Number: 15

Loading