TL;DR: We compare CE and BCE in deep feature learning, and find that BCE performs better than CE in enhancing feature properties.
Abstract: When training classification models, it expects that the learned features are compact within classes, and can well separate different classes. As the dominant loss function for training classification models, minimizing cross-entropy (CE) loss maximizes the compactness and distinctiveness, i.e., reaching neural collapse (NC). The recent works show that binary CE (BCE) performs also well in multi-class tasks. In this paper, we compare BCE and CE in deep feature learning. For the first time, we prove that BCE can also maximize the intra-class compactness and inter-class distinctiveness when reaching its minimum, i.e., leading to NC. We point out that CE measures the relative values of decision scores in the model training, implicitly enhancing the feature properties by classifying samples one-by-one. In contrast, BCE measures the absolute values of decision scores and adjust the positive/negative decision scores across all samples to uniformly high/low levels. Meanwhile, the classifier biases in BCE present a substantial constraint on the decision scores to explicitly enhance the feature properties in the training. The experimental results are aligned with above analysis, and show that BCE could improve the classification and leads to better compactness and distinctiveness among sample features. The codes have be released.
Lay Summary: This paper explores two loss functions for training classification models in machine learning: Binary Cross-Entropy (BCE) and Cross-Entropy (CE). It highlights how BCE can enhance the quality of learned features, leading to better classification.
Key Findings:
Feature Learning: BCE promotes compactness within classes and distinctness between classes, improving model performance.
Bias Impact: Classifier biases in BCE significantly enhance feature properties compared to CE.
Empirical Results: Experiments show that models trained with BCE outperform those trained with CE in classification accuracy and feature properties.
Overall, this research suggests that BCE is often a more effective loss function for training classification models.
Primary Area: Deep Learning->Theory
Keywords: cross-entropy loss, BCE, neural collapse, decision score, weight decay
Submission Number: 8765
Loading