Keywords: LLMs, PEFT, Safety Alignment, Spectral Analysis, Model Pruning
TLDR: We propose a data-free, post-hoc pruning method that detects and removes unsafe LoRA layers using spectral deviation analysis, improving the safety and deployability of adapted LLMs without retraining.
Abstract: Large Language Models (LLMs) adapted through Low Rank Adaptation (LoRA) often exhibit weakened safety alignment, even when fine tuned on benign datasets. Such degradation poses significant risks for deployable AI systems, where parameter updates can unintentionally introduce unsafe or unstable behaviors. In this work, we propose Directional Deviation Index Guided Pruning (DDI Pruning), a post hoc and data free framework for diagnosing and mitigating unsafe LoRA adaptations. DDI quantifies the spectral and directional deviation of each LoRA updated layer relative to its pretrained baseline, identifying layers that contribute most to instability or misalignment. Layers with high DDI scores are selectively pruned, improving both model robustness and computational efficiency without additional training or supervision. We evaluate the proposed approach on multiple language generation and agent planning benchmarks using several LLM backbones. Results show that DDI Pruning consistently reduces harmful or adversarial behaviors while preserving task accuracy and coherence. Ablation studies further demonstrate that each component of DDI contributes to capturing unsafe adaptation patterns, highlighting its interpretability and generality across domains. Overall, DDI Pruning provides an effective and practical mechanism for enhancing the safety alignment of adapted LLMs and contributes to the development of reliable and deployable AI systems.
Submission Number: 57
Loading