Unpacking Softmax: How Logits Norm Drives Representation Collapse, Compression and Generalization

Wojciech Masarczyk; Mateusz Ostaszewski; Tin Sum Cheng; Tomasz Trzcinski; Aurelien Lucchi; Razvan Pascanu

Unpacking Softmax: How Logits Norm Drives Representation Collapse, Compression and Generalization

Wojciech Masarczyk, Mateusz Ostaszewski, Tin Sum Cheng, Tomasz Trzcinski, Aurelien Lucchi, Razvan Pascanu

16 Sept 2025 (modified: 20 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: representation learning, transfer learning, neural collapse, optimization, softmax

Abstract: The \softmax function is a fundamental building block of deep neural networks, commonly used to define output distributions in classification tasks or attention weights in transformer architectures. Despite its widespread use and proven effectiveness, its influence on learning dynamics and learned representations remains poorly understood, limiting our ability to optimize model behavior. In this paper, we study the pivotal role of the \softmax function in shaping the model's representation. We introduce the concept of \emph{rank deficit bias} — a phenomenon that challenges the full-rank emergence predicted by Neural Collapse by finding solutions of rank much lower than the number of classes. This bias depends on the \softmax function's logits norm, which is implicitly influenced by hyperparameters or directly modified by \softmax temperature. We show how to exploit the \emph{rank deficit bias} to learn compressed representations or to enhance their performance on out-of-distribution data. We validate our findings across diverse architectures and real-world datasets, highlighting the broad applicability of temperature tuning in improving model performance. Our work provides new insights into the mechanisms of \softmax, enabling better control over representation learning in deep neural networks.

Supplementary Material: pdf

Primary Area: learning theory

Submission Number: 7454

Loading