DynaSafe-RL: Dynamic Behaviour Control with Reinforcement-Learning Safeguards for Safe LLM Dialogue

DynaSafe-RL: Dynamic Behaviour Control with Reinforcement-Learning Safeguards for Safe LLM Dialogue

ACL ARR 2026 January Submission8839 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: NLP, Optimization, Generative Models, LLM, Rudeness Detection, Reinforcement Learning, Prompt Optimization, AI Safety, In Context Unlearning, AI Alignment

Abstract: LLMs can produce harmful content in interactive dialogue, but retraining or weight editing is costly and often infeasible for black-box systems. This paper presents DynaSafe-RL, an adaptive behaviour-steering framework that regulates LLMs at runtime without retraining or parameter access. It operates through an adaptive, closed-loop prompt refinement mechanism driven by a reinforcement learning agent that selects optimal safeguarding strategies based on live safety–quality feedback. DynaSafe-RL significantly improves safety across 12 harm categories and four diverse LLMs, maintaining response quality and outperforming most handcrafted dynamic baselines\footnote{Full code and results are available at \url{https://anonymous.4open.science/r/DynaSafe-RL-126C/}}. Retention analysis shows that 60.12\%–94.50\% of the improved behaviour persists even after safeguards are removed. The framework is model-agnostic, lightweight, and suitable for practical deployment.

Paper Type: Long

Research Area: Computational Social Science, Cultural Analytics, and NLP for Social Good

Research Area Keywords: hate-speech detection,human-computer interaction, sociolinguistics,bias/toxicity

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 8839

Loading