Keywords: Trust Agents, LLM Evaluation Framework, and LLM Safety
TL;DR: We propose a dense-reward RL framework coupling a Soft Actor–Critic attacker, rewriter, and judge to evaluate and improve the trustworthiness of agentic LLMs through interpretable, query-efficient jailbreak analysis.
Abstract: Evaluating large language models (LLMs) under adversarial prompting remains difficult: heuristic and genetic red-teaming pipelines require heavy hand-tuning and are query-inefficient, while recent reinforcement learning based attacks often optimize sparse or binary rewards (e.g., pass/fail, cosine similarity), yielding unstable training and low diversity. In this paper, we present a rating-based adversarial RL framework that formulates jailbreak discovery as dense-reward optimization. The proposed framework closes the loop among (i) a Soft Actor--Critic (SAC) agent attacker with hybrid discrete--continuous actions (operator family + style sliders), (ii) a controllable rewriter LLM that preserves intent while injecting surface diversity and stealth, and (iii) a calibrated judge LLM that assigns five absolute ratings---\emph{Success, Stealth, Novelty, Efficiency, Impact}. A curriculum-weighted aggregation converts these ratings into a continuous reward, and stratified replay, early-exit, and de-duplication further improve sample and query efficiency. We instantiate the judge as LLaMA-3-Instruct and the rewriter as Yi-9B, and evaluate across three target models (LLaMA-3, Qwen-2.5-7B, Mistral-7B) using the SORRY-Bench dataset for training seeds/priors and the out-of-distribution harmful subset of the JailbreakBench dataset for testing. Empirically, our framework achieves up to a 15\% higher attack success rate while maintaining greater prompt novelty and stealth. These results indicate that dense, interpretable rating signals paired with off-policy optimization provide a scalable foundation for safety-aligned, query-efficient jailbreak evaluation.
Submission Number: 15
Loading