Weight Decay Shapes Representation Geometry: Towards a More Nuanced Understanding of Sparse Autoencoders in Vision Transformers
Keywords: Concept Discovery (e.g., SAEs, dictionary learning), Feature Geometry
TL;DR: Weight decay alters ViT representation geometry in ways that strongly affect sparse autoencoder feature recovery
Abstract: Mechanistic interpretability typically treats trained models as fixed objects, yet prior work shows that training fundamentally shapes representation geometry. We ask whether this geometry determines when sparse interpretability methods succeed versus fail. Training 64 ViT-Tiny models across varied hyperparameters on traffic sign datasets, we find that weight decay is the dominant factor shaping Sparse Autoencoder (SAE) behavior. Across the sweep, higher monosemanticity and fewer dead SAE features correlate with better cross-entropy recovery in deep layers. A matched weight-decay sweep reveals a sharp threshold near wd <0.01. Below it, SAE feature usage collapses into repeated reuse of the same small set; above it, diverse features emerge. This suggest that representation geometry, controlled by training choices like weight decay, determines whether sparse methods recover meaningful structure. Training should therefore be treated as part of the interpretability pipeline.
Submission Number: 408
Loading