The (Non-)Linear Representation of Toxicity in Qwen3 and Gemma-3

The (Non-)Linear Representation of Toxicity in Qwen3 and Gemma-3

ACL ARR 2026 January Submission10318 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: toxicity, explainability, interpretability, probing

Abstract: Toxic language is an important area of safety research in Large Language Models (LLMs). The linear representation hypothesis postulates that high-level concepts are encoded as linear directions in the activation space of LLMs. While this holds for many linguistic features, the representational geometry of toxicity remains under-explored. In this paper, we report a feature study of the Qwen3 and Gemma-3 model families across various toxicity datasets. Using a combination of activation patching and linear as well as non-linear probing experiments, we find that toxicity is not a monolithic linear feature in all transformer architectures. We demonstrate that non-linear probes significantly outperform linear ones in Qwen3, while Gemma-3 exhibits a more linear structure. Our results suggest that toxicity is represented as a manifold rather than a simple vector, and that this geometry varies significantly across model architectures. Our findings have critical implications for feature studies of domain-specific features, highlighting a limitation of linear probes under domain-specific circumstances.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: probing, patching

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 10318

Loading