Keywords: large languange model; detoxification; toxic subspace
Abstract: Large language models (LLMs) exhibit exceptional cross-domain performance but pose inherent risks of generating toxic content, restricting their safe deployment. Traditional detoxification methods (e.g., fine-tuning, alignment) only adjust output preferences without eliminating underlying toxic regions in parameters, leaving them vulnerable to adversarial attacks that reactivate toxicity.
Prior mechanistic studies model toxic regions in feed-forward networks as ''toxic vectors'' or ''layer-wise subspaces'', yet our analysis identifies critical limitations:
(1) Removed toxic vectors can be reconstructed via linear combinations of non-toxic vectors, demanding targeting of entire toxic subspace; (2) Contrastive objective over limited samples inject noise into layer-wise subspaces, hindering stable extraction.
This highlights the core challenge of identifying robust toxic subspace and removing them.
We address this by first uncovering a key insight: LLMs contain a shared global toxic subspace across layers, unaffected by layer-specific variations and enabling stable toxic representation.
Leveraging this, we propose **GLOSE** (**GL**obal t**O**xic **S**ubspace r**E**move) -- a lightweight method that mitigates toxicity by identifying and removing this global subspace from model parameters.
Extensive experiments on LLMs (e.g., Qwen3) show GloSS achieves state-of-the-art detoxification while preserving general capabilities. Critically, it avoids large-scale labeled datasets or full retraining, ensuring high real-world practicality.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 15343
Loading