Abstract: Interpreting language models remains challenging due to the existence of residual stream, which linearly mixes and duplicates information across adjacent layers. This leads to the under-detection of features that exist in the specific layer being analyzed. Current research works either analyze neural representations at single layers, thereby overlooking this cross-layer superposition, or utilize a cross-layer variant of sparse autoencoder (SAE) for analysis. However, SAEs operate in continuous space, so there are no clear boundaries between neurons representing different concepts. We address these limitations by introducing Cross-Layer vector quantized-variational autoencoder (VQ-VAE), a novel framework that maps representations across layers through vector quantization. This causes the collapse of duplicated features in the residual stream, thus resulting in compact, interpretable concept vectors. Our approach combines top-k temperature-based sampling during quantization with exponential moving average (EMA) codebook updates, providing controlled exploration of the discrete latent space while maintaining codebook diversity. Our experiments show that this framework, when combined with appropriate initialization, can effectively discover meaningful concepts. Our quantitative and qualitative experiments on the ERASER-Movie, Jigsaw, and AGNews datasets show that cross-layer VQ-VAE (CLVQ-VAE) can discover meaningful concepts that explain model predictions.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Akshay_Rangamani1
Submission Number: 6515
Loading