Tackling Polysemanticity with Neuron Embeddings

Published: 24 Jun 2024, Last Modified: 31 Jul 2024ICML 2024 MI Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Machine Learning, Interpretability, Mechanistic Interpretability
Abstract: We present neuron embeddings, a representation that can be used to tackle polysemanticity by identifying the distinct semantic behaviours in a neuron's characteristic dataset examples, making downstream manual or automatic interpretation much easier. We apply our method to GPT2-small, and provide a UI for exploring the results. Neuron embeddings are computed using a model's internal representations and weights, making them domain and architecture agnostic and removing the risk of introducing external structure which may not reflect a model's actual computation. We describe how neuron embeddings can be used to measure neuron polysemanticity, which could be applied to better evaluate the efficacy of Sparse Auto-Encoders (SAEs). We additionally provide a proof-of-concept that incorporates a new loss term based on neuron embeddings into the SAE loss function, and show that this has interesting results when applied to a small toy MLP trained on MNIST, trading off some representation accuracy and activation sparsity for more monosemantic neurons, and significantly reducing the prevalence of dead neurons. We provide another UI for exploring these results.
Submission Number: 86
Loading