Abstract: Shortcuts, spurious rules that perform well during training but fail to generalize, present a major challenge to the reliability of deep networks (Geirhos et al., 2020). To design principled shortcut-mitigation methods, it is crucial to understand how shortcuts affect feature representations. In this work, we investigate the layer-wise localization of shortcuts in deep models. We propose a novel experiment design that quantifies the layer-wise contribution to accuracy degradation caused by a shortcut. Our method introduces a shortcut-inducing data skew into the training process and counterfactually compares training on clean and skewed datasets using suitable shortcut-learning metrics. We employ our method to study vision classification shortcuts across the CIFAR-10, Waterbirds, and CelebA datasets and the VGG, ResNet, DeiT, and ConvNeXt architectures. We find that shortcut learning is not localized in specific layers but distributed throughout the network. Different network parts play different roles in this process: earlier layers predominantly encode spurious features, while later layers predominantly forget core features (i.e., features that are predictive on clean data). We analyze the differences in localization and describe their principal axes of variation. Finally, we investigate layer-wise training interventions and find that our localization metrics are predictive of their success.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Lei_Feng1
Submission Number: 9514
Loading