Keywords: ventral stream, circuit mechanisms, interpretability, deep learning, visual system, excitation inhibition, neuroscience, closed-loop optimization, ablation
TL;DR: Neural networks trained on ImageNet segregate the object/foreground features of their output layer to the positive input weights, with similar behavior in visual neurons.
Abstract: A core principle in both artificial and biological intelligence is the use of signed connections: positive and negative weights in artificial networks, and excitatory and inhibitory synapses in the brain. While both systems develop representations for diverse tasks, it is unclear whether positive and negative signals serve distinct representational roles or whether all representations require a balanced mixture of both. This is a fundamental question for mechanistic interpretability in neuroscience and AI.
Here, we investigate how signed weights shape visual representations in artificial and biological systems involved in object recognition. In ImageNet-trained neural networks, ablation and feature visualization reveal that removing positive inputs disrupts object features, while removing negative inputs preserves foreground representations but affects background textures. This segregation is more pronounced in adversarially robust models, persists with unsupervised learning, and vanishes with non-rectified activations.
To better approximate the excitation versus inhibition segregation observed in biology (Dale’s law), we identified channels that projected predominantly positive or negative weights to the next layer. In early and intermediate layers, positive-projecting channels encode localized, object-like features, while negative-projecting channels encode more dispersed, background-like features.
Motivated by these findings, we performed feature visualization in vivo in neurons in monkey visual cortex, across the ventral stream (V1, V4, and IT). We also fitted linear models using the input layer to classification units studied in ANNs that contained features alike those preferred by the biological neurons.
We replicated ablation experiments in these model neuron units and found, as with class units, that removing positive inputs altered representations more than removing negative ones.
Notably, some units closely approached Dale's law: the positively projecting units exhibited localized features, while the negatively projecting units showed larger, more dispersed features. Furthermore, we increased in vivo neuron responses by clearing the image background around the preferred feature, likely by reducing inhibitory inputs, providing concrete predictions for circuit neuroscientists to test.
Our results demonstrate that both artificial and biological vision systems segregate features by weight sign: positive weights emphasize objects, negative weights encode context. This emergent organization offers a new perspective on interpretability and the convergence of representational strategies in brains and machines, with important predictions for visual neuroscience.
Primary Area: interpretability and explainable AI
Submission Number: 21502
Loading