Keywords: vision transformers, optimal transport, distribution shifts, layer normalization
TL;DR: Seeing transformers as acting on measures, we identify that the layer normalization are the components of the transformers the the most sensitive to distribution shifts and investigate how this sensitivity translates into finetuning gains.
Abstract: Transformers have become the default backbone of large foundation models, achieving state-of-the-art results in natural language processing, computer vision, and time series analysis. These general-purpose models are typically finetuned by practitioners on specific tasks and domains. While most methods focus on reducing the computational cost of adapting ever-larger models, a complementary stance is to better understand how the transformer architecture responds to distribution shifts -- an avenue that can improve efficiency and performance. In this work, we propose an approach to study the sensitivity of transformer components to distribution shifts. By viewing sequences of tokens as discrete measures, we show that transformer encoders can be decomposed into measure-to-measure maps and define the sensitivity to distribution shifts based on an averaged notion of Lipschitz continuity, commonly associated with robustness in the literature. Through a comprehensive empirical investigation on large vision transformers (ViT) across 30 corrupted versions of ImageNet, we demonstrate that attention and feedforward normalization layers are consistently the most sensitive to input perturbations. While we do not observe that increased sensitivity steadily leads to better finetuning performance across all blocks, it is remarkably the case for the feedforward normalization layer that is both highly sensitive and matches or surpasses full finetuning while reducing the number of trainable parameters by a factor of $5000$.
Overall, our findings provide new insights into how transformer components behave under distribution shifts, showcasing that a better understanding of the transformer architecture can inform the design of more efficient adaptation methods.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 19815
Loading