Classification vs. Deep Feature Learning in Normalized Spaces with Different Scaling

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Classification, Deep Feature Learning, Scaling Factor, Neural Collapse
TL;DR: We in-depth compare the tasks of classification and deep feature learning by analyzing the minima of CE and BCE losses in normalized space.
Abstract: In supervised scenarios, deep feature learning is typically implemented through the training of classification models. However, it should be noted that classification reflects the sample-wise local properties of models on a dataset, while deep feature learning aims to learn features with good sample-independent global properties such as intra-class compactness and inter-class separability on the dataset. This paper conducts an in-depth comparison of classification and deep feature learning in normalized spaces. We first reformulate the binary cross-entropy (BCE) loss aligning with the fundamental requirements of feature learning; then, we theoretically analyze and compare its minima with that of the cross-entropy (CE) loss used for classification tasks. Informed by the above analysis, we explore the convergence behavior of the two losses when the scale factor $\gamma$ changes, revealing the differences between classification and deep feature learning. Specifically, when $\gamma$ increases linearly, the convergence rates of the two losses decay exponentially, resulting in poor feature properties for the trained models, although it does not affect their classification. As $\gamma$ decreases, the losses more readily reaches their minima, which helps to improve the feature properties. However, if $\gamma > 0$ decreases linearly and approaches zero, the convergence rate of the losses decay linearly, leading to unsatisfactory feature properties and making the models' classification highly sensitive to minor disturbances. Our experiments fully validate these conclusions. The experimental results also demonstrate the advantages of BCE over CE in more challenging scenarios such as long-tailed recognition and open-set recognition.
Primary Area: optimization
Submission Number: 1265
Loading