A Cross-Linguistic Analysis of Detoxifying LLMs with Knowledge Editing

ACL ARR 2024 December Submission1849 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Detoxification has consistently been at the forefront of the research in Large Language Models (LLMs) and employing knowledge editing (KE) techniques to purge toxic contents from LLMs has attracted much attention, a typical example of which is DINM. However, recent studies propose that KE techniques are language-dependent, meaning that editing knowledge in one language may not affect the same knowledge in other languages. If true, this hypothesis presents a major challenge for deploying KE-based detoxification methods like DINM in multilingual contexts. To comprehensively assess the effectiveness of DINM in multilingual scenarios, we first examine its generalizability by erasing toxic knowledge in eight languages other than English. We then validate the language-dependency hypothesis by detoxifying LLMs using English data and attacking them using eight other languages. Our findings suggest that the language-dependency hypothesis only partially holds: cross-lingual detoxification is feasible under certain conditions, with its effectiveness varying based on the model and the resource richness of the target language.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingual evaluation
Contribution Types: Model analysis & interpretability
Languages Studied: English, Spanish, French, Bengali, Hindi, Chinese, Thai, Malay, Vietnamese
Submission Number: 1849
Loading