The blindspot of Softmax Categorical Cross-Entropy

Fredrik Carlsson; Daniel Ward; Fangyu Liu; Joseph Ortiz; Joakim Nivre

The blindspot of Softmax Categorical Cross-Entropy

Fredrik Carlsson, Daniel Ward, Fangyu Liu, Joseph Ortiz, Joakim Nivre

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Neural Networks, LLM, Optimization, Theory

TL;DR: A fundamental, and systematic blind spot for CCE where erroneous probability mass can reside disproportionately long in the top rank for high-entropy states.

Abstract: Categorical cross-entropy (CCE) on softmax outputs has been the cornerstone of modern machine learning systems and large language models. Despite its central role, we uncover a systematic blind spot where erroneous probability mass can reside disproportionately long in the top rank prediction for high-entropy states. This means that as entropy increases, the encouragement of actually demoting invalid top-rank predictions diminishes. We prove that this undesirable bias is a fundamental property of learning distributions via CCE, and proceed to empirically demonstrate its existence across various settings. Using controllable synthetic settings, we are able to explicitly track this inefficiency, and find that introducing neural networks tend to further exacerbate this issue. This discovery holds true for for both dense neural networks and autoregressive Transformers trained with CCE for next-token prediction. Moreover, we find no indication that this disproportionately slow learning for high entropy states disappears as we scale the number of model parameters. Simply up-weighting the loss to counteract this slow learning in high entropy states does not result in any perceivable improvements. However, introducing consistency for high entropy states can significantly quicken the learning of good top ranks. Finally, we investigate the recently discovered hyperfitting phenomenon and find that its counterintuitive results can be understood from a similar principle, as it provides a training environment with extreme consistency, allowing the model to circumvent the CCE blindspot.

Primary Area: learning theory

Submission Number: 11723

Loading