Input differentiation via negative computation

Linghao Kong; Angelina Ning; Nir N Shavit

Input differentiation via negative computation

Linghao Kong, Angelina Ning, Nir N Shavit

Published: 09 Jun 2025, Last Modified: 11 Jul 2025HiLD at ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Interpretability, Activation function, Input differentiation, Entanglement

TL;DR: We show, for the first time, that the negative activation space in non-ReLU models plays a mechanistic role in computation, with entangled, Wasserstein neurons leveraging it to perform fine-grained input differentiation.

Abstract: Understanding neuronal mechanisms in large language models remains challenging, particularly due to polysemanticity and superposition. In this work, we further investigate the previously identified "Wasserstein neurons," characterized by non-Gaussian pre-activation distributions. Our analysis reveals that these neurons are more prevalent and exhibit faster learning dynamics in larger models. Critically, we demonstrate for the first time the mechanistic significance of the negative activation space, showing that Wasserstein neurons leverage negative pre-activations for nuanced input differentiation, especially regarding syntactic and structural tokens. Ablation experiments confirm that constraining negative activations significantly degrades model performance, highlighting previously underappreciated computational roles. These findings offer new directions for interpretability research by emphasizing the importance of negative computation.

Student Paper: Yes

Submission Number: 80

Loading