Abstract: This paper presents a reproducibility study of "Bilinear MLPs enable weight-based mechanistic interpretability" by Pearce et al. (2024), which proposes that bilinear architectures possess intrinsic interpretability properties accessible via eigenvalue decomposition. We verify the core empirical image classification claims. Our results confirm the findings for image classification: bilinear layers consistently exhibit an interpretable low-rank structure where the leading eigenvectors capture the majority of task-relevant information, allowing for significant truncation without performance loss. Furthermore, we validate that these eigenstructures are stable across random initializations and varying model sizes. Additionally, we explore extensions to the original work, demonstrating that adversarial training (specifically PGD) enhances the interpretability of eigenvector features on MNIST. Finally, we explored generalization on more complex RGB datasets, such as CIFAR-10 and CIFAR-100, which have generated eigenvectors with uninterpretable structures. All our code is publicly available at: https://anonymous.4open.science/r/reproduced-mech-inter-image-class
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=pRKGX2nX89&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: We have updated the template to the latest version to include the missing header. Please note that this is our third submission, and the only change made was to the template itself; our second submission contains the major content revisions compared to the first.
Assigned Action Editor: ~Ehsan_Amid1
Submission Number: 9465
Loading