GLOSE: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace

GLOSE: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace

ICLR 2026 Conference Submission15343 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large languange model; detoxification; toxic subspace

Abstract: Large language models (LLMs) exhibit exceptional cross-domain performance but pose inherent risks of generating toxic content, restricting their safe deployment. Traditional detoxification methods (e.g., fine-tuning, alignment) only adjust output preferences without eliminating underlying toxic regions in parameters, leaving them vulnerable to adversarial attacks that reactivate toxicity. Prior mechanistic studies model toxic regions in feed-forward networks as ''toxic vectors'' or ''layer-wise subspaces'', yet our analysis identifies critical limitations: (1) Removed toxic vectors can be reconstructed via linear combinations of non-toxic vectors, demanding targeting of entire toxic subspace; (2) Contrastive objective over limited samples inject noise into layer-wise subspaces, hindering stable extraction. This highlights the core challenge of identifying robust toxic subspace and removing them. We address this by first uncovering a key insight: LLMs contain a shared global toxic subspace across layers, unaffected by layer-specific variations and enabling stable toxic representation. Leveraging this, we propose **GLOSE** (**GL**obal t**O**xic **S**ubspace r**E**move) -- a lightweight method that mitigates toxicity by identifying and removing this global subspace from model parameters. Extensive experiments on LLMs (e.g., Qwen3) show GloSS achieves state-of-the-art detoxification while preserving general capabilities. Critically, it avoids large-scale labeled datasets or full retraining, ensuring high real-world practicality.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 15343

Loading