Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment
Keywords: ResponsibleAI, AI Safety, Safety Guardrails, Model Alignment, Hallucination Detection, Inference Efficiency, Model Adapters, Parameter-Efficient Fine-tuning, Factorized Representations
TL;DR: We introduce Disentangled Safety Adapters (DSA), lightweight modules that leverage a base model's representations to significantly improve AI safety and alignment with minimal computational overhead.
Abstract: Existing paradigms for ensuring AI safety, such as guardrail models and alignment training, often compromise either inference efficiency or development flexibility. We introduce Disentangled Safety Adapters (DSA), a novel framework addressing these challenges by decoupling safety-specific computations from a task-optimized base model. DSA utilizes lightweight adapters that leverage the base model's internal representations, enabling diverse and flexible safety functionalities with minimal impact on inference cost. Empirically, DSA-based safety guardrails substantially outperform comparably sized standalone models across hate speech classification, detecting unsafe model inputs and responses, and hallucination detection with relative improvements of up to 53% in AUC. Furthermore, DSA-based safety alignment allows dynamic, inference-time adjustment of alignment strength and a fine-grained trade-off between instruction following performance and model safety. Importantly, combining the DSA safety guardrail with DSA safety alignment facilitates context-dependent alignment strength, boosting safety on StrongReject by 93% while maintaining 98% performance on MTBench—a total reduction in alignment tax of 8 percentage points compared to standard safety alignment fine-tuning. Overall, DSA presents a promising path towards more modular, efficient, and adaptable AI safety and alignment.
Submission Number: 71
Loading