The (Non-)Linear Representation of Toxicity in Qwen3 and Gemma-3

ACL ARR 2026 January Submission10318 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: toxicity, explainability, interpretability, probing
Abstract: Toxic language is an important area of safety research in Large Language Models (LLMs). The linear representation hypothesis postulates that high-level concepts are encoded as linear directions in the activation space of LLMs. While this holds for many linguistic features, the representational geometry of toxicity remains under-explored. In this paper, we report a feature study of the Qwen3 and Gemma-3 model families across various toxicity datasets. Using a combination of activation patching and linear as well as non-linear probing experiments, we find that toxicity is not a monolithic linear feature in all transformer architectures. We demonstrate that non-linear probes significantly outperform linear ones in Qwen3, while Gemma-3 exhibits a more linear structure. Our results suggest that toxicity is represented as a manifold rather than a simple vector, and that this geometry varies significantly across model architectures. Our findings have critical implications for feature studies of domain-specific features, highlighting a limitation of linear probes under domain-specific circumstances.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: probing, patching
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 10318
Loading