Keywords: Pluralistic Alignment, Personalised Alignment
Abstract: Current language model safety paradigms often fall short in emotionally charged or high-stakes settings, where refusal-only approaches may alienate users and naive compliance can amplify risk. We propose **ProSocialAlign**, a test-time, parameter-efficient framework that steers generation toward safe, empathetic, and value-aligned responses without retraining the base model. We formalize five human-centered objectives and cast safety as lexicographic constrained generation: first, applying hard constraints to eliminate harmful continuations; then optimizing for prosocial quality within the safe set. Our method combines (i) *directional regulation*, a harm-mitigation mechanism that subtracts a learned "harm vector" in parameter space, and (ii) *preference-aware autoregressive reward modeling* trained jointly across attributes with *gradient conflict resolution*, enabling fine-grained, user-controllable decoding. Empirical evaluations across five safety benchmarks demonstrate state-of-the-art performance, reducing unsafe leakage and boosting alignment to human values, with strong gains across multiple evaluation metrics. **ProSocialAlign** offers a robust and modular foundation for generating context-sensitive, safe, and human-aligned responses at inference time. To facilitate reproducibility, we will publicly release the full source code and dataset upon acceptance.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 16749
Loading