LLMSafeGuard: A Training-Free Framework for Safeguarding LLM Decoding via Context-Wise Similarity Validation
Keywords: LLM safeguard, Decoding, Copyright, Detoxification, Jailbreak defense
Abstract: Large Language Models (LLMs) have advanced NLP but also introduce ethical and societal risks by generating harmful content. Existing mitigation methods often require training separate control models or proactively intervening during decoding, which can degrade quality and increase computational cost. To address these limitations, we propose LLMSafeGuard, a lightweight real-time framework that integrates an external validator into decoding to reject unsafe outputs while preserving valid ones. It uses a similarity-based validation method that removes the need for control-model training and a context-wise timing strategy that intervenes only when necessary.
We evaluate LLMSafeGuard on detoxification, copyright safeguarding, and jailbreak defense tasks across six models. LLMSafeGuard outperforms SOTA baselines across all tasks. For example, reducing toxic output by at least 38.6% while preserving linguistic quality. Our context-wise timing achieves a 1.7× speedup over per-step validation without sacrificing effectiveness.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Computational Social Science and Cultural Analytics, Ethics, Bias, and Fairness,Generation,Language Modeling
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 451
Loading