SecDecoding: Steerable Decoding for Safer LLM Generation

SecDecoding: Steerable Decoding for Safer LLM Generation

ACL ARR 2025 May Submission4068 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have achieved remarkable performance across diverse tasks, yet ensuring output safety remains a fundamental challenge. Existing defense methods often suffer from limited generalization, high computational overhead, or significant utility degradation. In this work, we present SecDecoding, a lightweight decoding-time defense framework that significantly improves output safety without compromising model helpfulness. SecDecoding leverages a pair of small contrastive models, namely a base model and a safety fine-tuned expert, to estimate token-level safety signals by measuring divergence in their output distributions. These signals dynamically steer the target model’s generation toward safer trajectories, effectively suppressing unsafe content. Experimental results show that SecDecoding achieves near-zero attack success rates against a wide spectrum of advanced jailbreak attacks across multiple LLMs, while maintaining the model’s helpfulness with minimal degradation. Additionally, SecDecoding is a modular and resource-efficient approach that requires only an auxiliary 1-billion-parameter model and is compatible with speculative decoding, offering up to 1.5× inference speedup.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: safety and alignment

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 4068

Loading