Cross-Layer Discrete Concept Discovery for Interpreting Language Models

Cross-Layer Discrete Concept Discovery for Interpreting Language Models

TMLR Paper8834 Authors

08 May 2026 (modified: 22 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Interpreting language models remains challenging due to the residual stream, which linearly mixes and duplicates features across adjacent layers, causing single-layer analyses to miss this cross-layer structure. Cross-layer sparse autoencoders (SAEs) address layer mixing but operate in continuous space, where concepts split across many neurons without clear boundaries. We introduce CLVQ-VAE, a novel framework that maps representations from a lower layer to a higher layer through a discrete vector-quantization bottleneck, collapsing duplicated residual-stream features into compact, interpretable concept vectors. Our approach combines top-k temperature-based sampling with exponential moving average (EMA) codebook updates, providing controlled exploration of the discrete latent space while maintaining codebook diversity. Across both encoder- and decoder-based models on ERASER-Movie, Jigsaw, and AGNews, CLVQ-VAE outperforms clustering, single-layer VQ-VAE, and SAE baselines across three evaluation axes: removing identified concepts drops model accuracy by up to 93%, LLM judges rank our concepts first in 66.7% of comparisons, and human annotators recover model predictions from our visualizations with 78% accuracy versus 54% for clustering.

Submission Type: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=xBVTqiHY6l

Changes Since Last Submission: This revision directly addresses both concerns raised in the prior decision and adds further controls to make the comparison stronger. 1. Stronger SAE baseline (addresses Concern 1) The previous submission ablated SAEs by removing a single decoder vector corresponding to the top-activated neuron, which is unfair to SAEs given that they tend to split features across many neurons. Following the AE's suggestion, the SAE ablation is now performed via orthogonal projection onto the subspace spanned by the top-$k$ encoder weight vectors, with the basis orthonormalized via Gram-Schmidt (Section 4.1.1; Appendix B.3, Eq. 13). We use encoder weights rather than decoder vectors because they live in the input-layer space of the perturbed token, ensuring geometrically consistent ablation. A full sweep over $k \in \{1, 5, 10\}$ is reported in Appendix A.3.2 (Table 10); $k=10$ is used in the main results. SAE training hyperparameters are also revisited (Appendix A.3.2): expansion ratios and $L_1$ coefficients are now reported separately for encoder and decoder models, with input-centering bias and encoder/decoder bias initialization documented. Under this fairer baseline, CLVQ-VAE still outperforms SAE in 10 of 12 configurations, including all 6 decoder-model settings by an average of 15.8 percentage points. As a consequence of the stronger SAE, the Table 1 narrative (Section 4.1.2) is updated: CLVQ-VAE wins 5 of 12 settings outright, and a VQ-VAE method (CLVQ-VAE or Single-Layer) ranks first or second in all 12 configurations. 2. Codebook concept specificity (addresses Concern 2) We add a new Section 4.2.1 (Codebook Concept Specificity) with three complementary analyses showing that codebook vectors encode specific concepts rather than compressing entire sentence representations: - Vector-level (label purity): purity exceeds the random baseline across all 12 configurations (0.60–0.76 vs. 0.50 for binary tasks; 0.35–0.47 vs. 0.25 for AGNews), with a notable fraction of vectors firing exclusively on a single class (Figure 2). - Token-level (label-conditioned routing): 28–76% of content tokens route to different codebook vectors depending on sentence label, with mean Jensen–Shannon divergence of 0.56–0.92. Table 16 contains the discriminative test the AE explicitly requested: e.g., "entertainment" routes to vector #137 (91% negative purity) in a negative review but to vector #73 (97% positive purity) in a positive review, with further examples for Jigsaw and AGNews. - Sentence-level (TF-IDF-conditioned overlap): at controlled lexical similarity, same-label sentence pairs share 2.10–7.56× more codebook vectors than different-label pairs, widening to 30.56× at stricter thresholds (e.g., Jigsaw/RoBERTa at $\tau=0.3$). Conditioning on TF-IDF cosine similarity isolates semantic from surface effects. Pairwise cosine similarity between codebook vectors is also reported (Appendix E.1, Table 18). Full per-configuration results across all four models and three datasets are in Table 17. 3. Additional strengthening Beyond the two AE concerns: - Llama Scope sparse transcoder (Appendix C.3, Table 12): to verify that the weak SAE faithfulness on decoder models is not specific to the Dunefsky implementation, we additionally evaluate against a Llama Scope-aligned sparse transcoder (TopK activation, $k=32$). It produces similar near-zero or negative drops on decoder models, confirming this reflects a general limitation of SAE-based methods rather than an implementation artifact, consistent with the feature-splitting hypothesis. - Random Active Codebook control (Appendix C.4, Table 13): we add a control that ablates a randomly sampled active codebook vector (one assigned to other tokens but not the salient one) rather than the predicted concept. The identified salient concept causes substantially greater performance degradation than other active codebook vectors in 8 of 9 configurations, confirming the drops in Table 1 are specific to the concept-aligned direction rather than an artifact of removing any learned direction. 4. Improved the overall writing and clarity of the text. The anonymized code repository remains at https://anonymous.4open.science/r/CLVQVAE-9386

Assigned Action Editor: ~Mengnan_Du1

Submission Number: 8834

Loading