Weight Decay Shapes Representation Geometry: Towards a More Nuanced Understanding of Sparse Autoencoders in Vision Transformers

Nina Burdorf; Christian Medeiros Adriano

Weight Decay Shapes Representation Geometry: Towards a More Nuanced Understanding of Sparse Autoencoders in Vision Transformers

Nina Burdorf, Christian Medeiros Adriano

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Concept Discovery (e.g., SAEs, dictionary learning), Feature Geometry

TL;DR: Weight decay alters ViT representation geometry in ways that strongly affect sparse autoencoder feature recovery

Abstract: Mechanistic interpretability typically treats trained models as fixed objects, yet prior work shows that training fundamentally shapes representation geometry. We ask whether this geometry determines when sparse interpretability methods succeed versus fail. Training 64 ViT-Tiny models across varied hyperparameters on traffic sign datasets, we find that weight decay is the dominant factor shaping Sparse Autoencoder (SAE) behavior. Across the sweep, higher monosemanticity and fewer dead SAE features correlate with better cross-entropy recovery in deep layers. A matched weight-decay sweep reveals a sharp threshold near wd <0.01. Below it, SAE feature usage collapses into repeated reuse of the same small set; above it, diverse features emerge. This suggest that representation geometry, controlled by training choices like weight decay, determines whether sparse methods recover meaningful structure. Training should therefore be treated as part of the interpretability pipeline.

Submission Number: 408

Loading