Latent Personality Alignment: Improving harmlessness without mentioning harms

ICLR 2026 Conference Submission13924 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Safety, Personality Traits, Latent Adversarial Training, Adversarial Training, Safety Finetuning
Abstract: Current safety training methods for large language models rely on extensive datasets of harmful prompts paired with refusal responses, requiring many thousands of examples to achieve robustness against adversarial attacks. However, these approaches suffer from poor generalization to novel attack vectors and require substantial computational resources. We propose Latent Personality Alignment (LPA), a data-efficient alternative that trains models to embody beneficial personality traits rather than memorizing specific refusal patterns. Using fewer than 100 abstract personality statements, LPA guides models toward positive traits through latent adversarial training. Our approach achieves comparable safety performance to methods trained on hundreds of thousands of harmful examples while maintaining superior utility on benign tasks. These results suggest that personality-based alignment offers a more principled and scalable approach to harmlessness training than current methods.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 13924
Loading