Which transformers components are the most sensitive to distribution shifts?

Which transformers components are the most sensitive to distribution shifts?

ICLR 2026 Conference Submission19815 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision transformers, optimal transport, distribution shifts, layer normalization

TL;DR: Seeing transformers as acting on measures, we identify that the layer normalization are the components of the transformers the the most sensitive to distribution shifts and investigate how this sensitivity translates into finetuning gains.

Abstract: Transformers have become the default backbone of large foundation models, achieving state-of-the-art results in natural language processing, computer vision, and time series analysis. These general-purpose models are typically finetuned by practitioners on specific tasks and domains. While most methods focus on reducing the computational cost of adapting ever-larger models, a complementary stance is to better understand how the transformer architecture responds to distribution shifts -- an avenue that can improve efficiency and performance. In this work, we propose an approach to study the sensitivity of transformer components to distribution shifts. By viewing sequences of tokens as discrete measures, we show that transformer encoders can be decomposed into measure-to-measure maps and define the sensitivity to distribution shifts based on an averaged notion of Lipschitz continuity, commonly associated with robustness in the literature. Through a comprehensive empirical investigation on large vision transformers (ViT) across 30 corrupted versions of ImageNet, we demonstrate that attention and feedforward normalization layers are consistently the most sensitive to input perturbations. While we do not observe that increased sensitivity steadily leads to better finetuning performance across all blocks, it is remarkably the case for the feedforward normalization layer that is both highly sensitive and matches or surpasses full finetuning while reducing the number of trainable parameters by a factor of $5000$. Overall, our findings provide new insights into how transformer components behave under distribution shifts, showcasing that a better understanding of the transformer architecture can inform the design of more efficient adaptation methods.

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 19815

Loading