A Reproduction Study of Weight-Based Mechanistic Interpretability in Bilinear MLPs

A Reproduction Study of Weight-Based Mechanistic Interpretability in Bilinear MLPs

TMLR Paper9470 Authors

03 Jun 2026 (modified: 09 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Mechanistic interpretability typically relies on post-hoc analysis of model activations, but bilinear MLPs offer an alternative: architectures whose weights are directly interpretable through eigendecomposition of interaction tensors. We reproduce both main experiments from Pearce et al. (2025): their Section 4 (Vision) on MNIST/Fashion-MNIST, and their Section 5 (Language) discovering sentiment negation circuits via Sparse Autoencoder analysis. Vision results reproduce cleanly: weight decay reduces effective rank from 38.5 to 15.5 while maintaining 97--98% accuracy, and our ablation shows that weight decay $-$ not noise augmentation $-$ is the primary driver of low-rank structure. In language, we confirm the AND-gate negation circuit (two semantically contrasting negation features, cosine similarity $-0.16$), but do \emph{not} fully reproduce the low-rank interaction claim: the fraction of features achieving $>$0.75 rank-2 correlation varies from 32\% (ts-medium) to 65% (fw-small); only fw-small meets this threshold. We provide threshold sensitivity analysis and trace the gap to SAE training duration (correlation improves 2.6$\times$ over five checkpoints) and model compute (tokens/parameter); the fw-medium configuration required 8$\times$ rather than 4$\times$ expansion SAEs, making exact reproduction impossible $-$ language results constitute constrained replication under publicly available artifacts. In extensions, regularized bilinear MLPs transfer structurally across digit and letter datasets: MNIST-trained models classify geometrically similar EMNIST letters (O$\to$0, I$\to$1, Z$\to$2, S$\to$5) at $87-100\%$ accuracy. We propose Quadratic Form Similarity, which separates similar from dissimilar digit-letter pairs (QFS $0.40$ vs. $-0.06$, $p<10^{-4}$) where cosine similarity fails (0.358 vs. 0.339). Finally, we explore CP-decomposition as an architectural constraint, achieving 93.8% accuracy with effective rank 17.5 at $\sim$30$\times$ faster training, with CP factors that appear qualitatively more localized than dense eigenvectors $-$ though interpretability gains remain preliminary.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Matthew_Walter1

Submission Number: 9470

Loading