Targeted Neuron-Level Fine Tuning for Multilingual Toxicity Mitigation in Large Language Models

ACL ARR 2025 May Submission6776 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Mitigating large language models (LLMs) towards toxic inputs is a challenging task, particularly in handling multiple languages. In this research, we focus on fine-tuning methods using multilingual toxicity mitigation instruction dataset. For this purpose, we curate an instruction dataset covering 9 languages. We collect open-source multilingual hate speech datasets and then generate non-toxic responses using an open-source LLM. To address the trade-off between general performance and mitigating toxicity, we propose a targeted-neuron fine-tuning method that focuses on identified multilingual toxic neurons. Our experiments compare multilingual and English-centric LLMs, revealing that multilingual models benefit more from per-language neuron fine-tuning, achieving better toxicity mitigation results. In contrast, full fine-tuning (FFT) tends to have better toxicity mitigation result in English-centric models. However, our further analysis shows that FFT can lead to issues such as empty responses or language-inconsistent replies. Compared to FFT, the multilingual targeted-neuron fine-tuning method has slightly lower performance in toxicity mitigation, but produces more language consistent responses. Additionally, we conclude that toxic-neuron fine-tuning achieves better general performance than FFT, showing its effectiveness in balancing trade-off between toxicity mitigation with general performance. Warning: This paper contains toxic and harmful contents.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: Multilingual fine-tuning, multilingual representations, model bias/unfairness mitigation
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: Arabic, Chinese, English, French, German, Hindi, Indonesian, Portuguese, Russian
Submission Number: 6776
Loading