Keywords: Interpretability, Activation function, Input differentiation, Entanglement
TL;DR: We show, for the first time, that the negative activation space in non-ReLU models plays a mechanistic role in computation, with entangled, Wasserstein neurons leveraging it to perform fine-grained input differentiation.
Abstract: Understanding neuronal mechanisms in large language models remains challenging, particularly due to polysemanticity and superposition. In this work, we further investigate the previously identified "Wasserstein neurons," characterized by non-Gaussian pre-activation distributions. Our analysis reveals that these neurons are more prevalent and exhibit faster learning dynamics in larger models. Critically, we demonstrate for the first time the mechanistic significance of the negative activation space, showing that Wasserstein neurons leverage negative pre-activations for nuanced input differentiation, especially regarding syntactic and structural tokens. Ablation experiments confirm that constraining negative activations significantly degrades model performance, highlighting previously underappreciated computational roles. These findings offer new directions for interpretability research by emphasizing the importance of negative computation.
Student Paper: Yes
Submission Number: 80
Loading