Keywords: safety alignment; pluralistic alignment
Abstract: Current safety alignment methods for large language models use a rigid, one-size-fits-all approach, blocking any content deemed unsafe by the model provider. This lacks flexibility for varying cultural norms and diverse user safety needs, making these static models too restrictive. We propose _Controllable Safety Alignment_, a framework designed to adapt models to diverse safety requirements without re-training. Instead of aligning a fixed model, we align models to follow _safety configs_, a natural language description of the desired safety requirements, defined as part of the system prompt. To adjust model safety behavior a user only needs to modify the safety config ahead of inference. We devise a novel controllability evaluation protocol that considers both helpfulness and safety, summarizing them into ControlScore. We propose LACUSA, a data-centric method for controllable safety alignment. On our curated test sets, LACUSA lead to substantial gains of controllability over strong baselines. Our framework not only expands the practicality of aligned LLMs but also contributes to models that better represents pluralistic human values.
Submission Number: 31
Loading