Keywords: vision transformers, OOD, out-of-distribution generalization, diversification, distribution shifts
TL;DR: We propose an attention head diversification method for Vision Transformers that together with head selection improves OOD generalization.
Abstract: Deep learning models often learn and rely only on a small set of features, even when there is a richer set of predictive signals in the training data. This makes models brittle and sensitive to distribution shifts. In this work, we show how to diversify the features learned by vision transformers (ViTs). We find that their attention heads inherently induce some modularity in their internal representations. We propose a new regularizer that acts on their input gradients and further enhances the diversity and complementarity of the learned features. We observe improved out-of-distribution (OOD) robustness on standard diagnostic benchmarks (MNIST-CIFAR and Waterbirds). We also show that a much higher performance can be achieved by identifying and pruning the attention heads that extract spurious features.
Submission Number: 67
Loading