RLShield: Dynamic Jailbreak Detection for LLMs via Reinforced Adaptive Learning

RLShield: Dynamic Jailbreak Detection for LLMs via Reinforced Adaptive Learning

ACL ARR 2026 January Submission9813 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Jailbreak Detection; Adaptive Threshold; Reinforcement Learning

Abstract: While prompt engineering enhances the capabilities of Large Language Models (LLMs), it also exposes critical safety concerns. Due to their "black-box" nature, LLMs are vulnerable to jailbreak prompts, adversarial inputs designed to bypass safeguards and induce the generation of harmful content. Existing detection mechanisms rely on static model components or fixed decision thresholds, limiting their ability to generalize to evolving attack patterns and continual model updates. To bridge this gap, we propose RLShield, a dynamic jailbreak detection framework that employs reinforcement learning for adaptive threshold selection. RLShield incorporates three key innovations: first, a dynamic retrieval and LLM-based rewriting module to simulate diverse adversarial contexts; second, a cross-layer representation analysis to pinpoint safety-critical parameters; and third, a Soft Actor-Critic (SAC)-based agent that learns to predict optimal, sample-specific detection thresholds. Experimental results demonstrate that RLShield consistently outperforms state-of-the-art baselines in detection accuracy while maintaining high computational efficiency. Notably, it improves F1 by up to 7.3\%, while achieving an average of 3$\times$ gain in inference efficiency across multiple LLM backbones.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: safety and alignment, NLP Applications, Language Modeling

Contribution Types: NLP engineering experiment

Languages Studied: English,Chinese

Submission Number: 9813

Loading