From Minor Adjustment to Major Gains: Soft Logit Normalization Loss Enhances Representations and Generalization

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: cross-entropy loss, soft logit normalization loss, generalization improvement, ImageNet-1K, BERT
Abstract: Developing novel loss functions for small models to attain performance parity with their larger counterparts is an active research area in artificial intelligence. We propose the Soft Logit Normalization (SLN) loss, which normalizes the logit vector by its powered L2-norm before applying the standard softmax function. In comparison with the classical cross-entropy loss, SLN loss significantly improves generalization across multiple vision benchmarks, including CIFAR-10 and ImageNet-1K, enabling small models to match the performance of models with approximately three times more parameters—an improvement comparable to that achieved by advanced knowledge distillation techniques. Beyond vision tasks, experiments on language tasks with large transformer-based models (e.g., BERT$_{LARGE}$ with 340M parameters) demonstrate the versatility of SLN loss across modalities. Theoretical analysis further show that SLN loss facilitates more separable penultimate-layer representations, which contributes to better generalization, as numerically validated on diverse datasets. This work not only advances the practical deployment of efficient models on resource-constrained devices but also opens new directions for research into loss function design.
Primary Area: optimization
Submission Number: 23779
Loading