Keywords: Mechanistic Interpretability, Sparse Autoencoders, Variational Autoencoders, Machine Learning, Interpretability
TL;DR: Applied variational method to sparse autoencoders and compared to current methods.
Abstract: Sparse Autoencoders (SAEs) have emerged as a promising approach for interpreting neural network representations by learning sparse, human-interpretable features from dense activations. We
investigate whether incorporating variational methods into SAE architectures can improve feature
organization and interpretability. We introduce the variational Sparse Autoencoder (vSAE), which
replaces deterministic ReLU gating with stochastic sampling from learned Gaussian posteriors and
incorporates KL divergence regularization toward a standard normal prior. Our hypothesis is that
this probabilistic sampling creates dispersive pressure, causing features to organize more coherently
in the latent space while avoiding overlap. We evaluate a Topk vSAE against a standard TopK SAE
on Pythia-70M transformer residual steam activations using comprehensive benchmarks including
SAE Bench, individual feature interpretability analysis, and global latent space visualization through
t-SNE. The vSAE underperforms standard SAE across core evaluation metrics, though excels at
feature independence and ablation metrics. The KL divergence term creates excessive regularization
pressure that substantially reduces the fraction of living features, leading to observed performance
degradation. While vSAE features demonstrate improved robustness, they exhibit many more dead
features than baseline. Our findings suggest that naive application of variational methods to SAEs
does not improve feature organization or interpretability.
Primary Area: interpretability and explainable AI
Submission Number: 22433
Loading