Abstract: Large language models (LLMs) have achieved remarkable performance across diverse tasks, yet ensuring output safety remains a fundamental challenge. Existing defense methods often suffer from limited generalization, high computational overhead, or significant utility degradation. In this work, we present SecDecoding, a lightweight decoding-time defense framework that significantly improves output safety without compromising model helpfulness. SecDecoding leverages a pair of small contrastive models, namely a base model and a safety fine-tuned expert, to estimate token-level safety signals by measuring divergence in their output distributions. These signals dynamically steer the target model’s generation toward safer trajectories, effectively suppressing unsafe content. Experimental results show that SecDecoding achieves near-zero attack success rates against a wide spectrum of advanced jailbreak attacks across multiple LLMs, while maintaining the model’s helpfulness with minimal degradation. Additionally, SecDecoding is a modular and resource-efficient approach that requires only an auxiliary 1-billion-parameter model and is compatible with speculative decoding, offering up to 1.5× inference speedup.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: safety and alignment
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 4068
Loading