Decoupled Alignment for Robust Plug-and-Play Adaptation

02 Sept 2025 (modified: 03 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Jailbreak Aligner, Memory Editing
Abstract: We introduce a training-free safety enhancement method for aligning large language models (LLMs) without the need of supervised fine-tuning or reinforcement learning from human feedback. Our main idea is to provide a robust plug-and-play approach to prevent shadow alignment when models are adapted to downstream tasks. Specifically, we exploit knowledge distillation to extract alignment information from well-aligned LLMs and integrate it into LLMs affected by shadow alignment, in a plug-and-play manner. Methodology, we employ delta debugging to identify the critical components of knowledge necessary for effective distillation. On the harmful question dataset, our method significantly enhances the average defense success rate by approximately 14.41\%, reaching as high as 51.39\%, in 17 influenced LLMs, without compromising performance.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 1103
Loading