Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models

ACL ARR 2025 February Submission4207 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent breakthroughs in Large Language Models (LLMs) have revealed remarkable generative capabilities and advanced self-processing mechanisms, including self-correction and self-rewarding. However, current detoxification techniques rarely exploit these built-in abilities; instead, they rely on external modules, labor-intensive data annotation, or human intervention, thereby limiting scalability and consistency. In this paper, we introduce a fully self-reflective detoxification framework that harnesses the intrinsic strengths of LLMs to detect, correct toxic content, and refine LLMs without external modules and data annotation. Specifically, we propose a Toxic Signal Detector—an internal self-identification mechanism, coupled with a systematic intervention process to transform toxic text into its non-toxic counterpart. This iterative procedure yields a contrastive detoxification dataset, which is subsequently leveraged to fine-tune the model, enhancing its ability for safe and coherent text generation. Experimental evaluations on benchmark corpora such as DetoxLLM and ParaDetox show that our method achieves state-of-the-art detoxification performance while preserving semantic fidelity. By obviating the need for human intervention or external component, this paper reveals the intrinsic self-detoxification ability of LLMs, offering a consistent and effective approach for mitigating harmful content generation. Ultimately, our finds underscore the potential for truly self-regulated language models, paving the way for more responsible and ethically guided text generation systems.\footnote{Code:\url{https://anonymous.4open.science/r/SRD-6CB4/}}\textit{\textbf{Warning: this paper may contain offensive content.}}
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/unfairness mitigation, Large Language Model, Self-Reflective
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 4207
Loading