Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment

ACL ARR 2025 February Submission6909 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Ensuring consistent safety across multiple languages remains a significant challenge for large language models (LLMs). We introduce \textsc{Soteria}, a lightweight yet powerful strategy that locates and minimally adjusts the “functional heads” most responsible for harmful content generation in each language. By altering only a fraction of parameters, \textsc{Soteria} drastically reduces policy violations without sacrificing overall model performance, even in low-resource settings. To rigorously evaluate our approach, we also present XThreatBench, a specialized multilingual dataset capturing fine-grained harmful behaviors drawn from real policy guidelines. Experiments with leading open-source LLMs (e.g., Llama, Qwen, Mistral) show that \textsc{Soteria} consistently improves safety metrics across high-, mid-, and low-resource languages. These findings highlight a promising path toward scalable, linguistically attuned, and ethically aligned LLMs worldwide. We will make the dataset and source code publicly available upon acceptance.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: LLM Safety Alignment, Multilingual Safety Alignment
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Data resources
Languages Studied: Englis, Chinese, Spanish, German, French, Hindi, Bulgarian, Arabian, Thai, Bengali, Tamil, Telugu
Submission Number: 6909
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview