Dropout and the Outliers: Could Transformers Overcome Their Single Points of Failure?

Published: 02 Mar 2026, Last Modified: 26 Apr 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Dropout, outliers, Transformers, Empirical Study
Abstract: Due to their complex structure, Transformer architectures give rise to curious empirical phenomena. One such phenomenon recently attracted significant attention: the formation of disproportionately large attention, weight and activation values during training, often called outliers. Hardware, security and other favorable properties make these extreme values undesirable for robust architectures. While many recent works observed the issue of outliers in transformer-based models, rarely did they consider Dropout's effect, an algorithm that was initially designed to increase neural networks' robustness. In this work, we provide a systematic assessment of the effect of Dropout on the formation of outliers across different modality, architecture and optimization choices. We show in ours setups that Dropout helps reduce the outliers on average but does not suppress them completely. Our findings provide a paradoxical view, that contrasts with the folklore belief that Dropout tends to equalize values across the network. It also raises important questions on the implicit bias of Dropout in transformer-based models within certain optimization and architectural choices.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 106
Loading