Track: long paper (up to 5 pages)
Keywords: Hebbian Learning, Sparse Autoencoder, Dictionary Learning, Interpretability
TL;DR: Hebbian Winner-Take-All learning rules with anti-Hebbian updates can be framed as a form of tied-weight sparse autoencoder training.
Abstract: We establish a theoretical and empirical connection between Hebbian Winner-Take-All (WTA) learning with anti-Hebbian updates and tied-weight sparse autoencoders (SAEs), offering a framework to explain the high selectivity of neurons to patterns induced by biologically inspired learning rules. By training a SAE on token embeddings of a small language model using a gradient-free Hebbian WTA rule with competitive anti-Hebbian plasticity, we demonstrate that such methods implicitly optimize SAE objectives. However, they underperform backpropagation SAEs in reconstruction due to gradient approximations. Hebbian updates approximate reconstruction error (MSE) minimization under tied weights, while anti-Hebbian updates enforce sparsity/feature orthogonality, akin to explicit L1/L2 penalties in standard SAEs. This alignment with the superposition hypothesis (Elhage et al., 2022) reveals how Hebbian rules disentangle features in overcomplete latent spaces, marking the first application of Hebbian learning to SAEs for language model interpretability.
Submission Number: 34
Loading