Keywords: NLP, Optimization, Generative Models, LLM, Rudeness Detection, Reinforcement Learning, Prompt Optimization, AI Safety, In Context Unlearning, AI Alignment
Abstract: LLMs can produce harmful content in interactive dialogue, but retraining or weight editing is costly and often infeasible for black-box systems. This paper presents DynaSafe-RL, an adaptive behaviour-steering framework that regulates LLMs at runtime without retraining or parameter access. It operates through an adaptive, closed-loop prompt refinement mechanism driven by a reinforcement learning agent that selects optimal safeguarding strategies based on live safety–quality feedback. DynaSafe-RL significantly improves safety across 12 harm categories and four diverse LLMs, maintaining response quality and outperforming most handcrafted dynamic baselines\footnote{Full code and results are available at \url{https://anonymous.4open.science/r/DynaSafe-RL-126C/}}. Retention analysis shows that 60.12\%–94.50\% of the improved behaviour persists even after safeguards are removed. The framework is model-agnostic, lightweight, and suitable for practical deployment.
Paper Type: Long
Research Area: Computational Social Science, Cultural Analytics, and NLP for Social Good
Research Area Keywords: hate-speech detection,human-computer interaction, sociolinguistics,bias/toxicity
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 8839
Loading