Keywords: Mechanism Interpretability, Model Optimization, Class Imbalance
TL;DR: beyond element-wise gradient balancing, Adam can balance layer-level gradients across different iterations to mitigate class imbalance.
Abstract: Adam has remained a dominant optimization algorithm in deep learning for a decade. Recent studies reveal that Adam mitigates the class imbalance by normalizing element-level gradients to balance gradients across classes. However, this interpretation relies on an assumption that gradients between different classes are fully orthogonal. In this paper, we further investigate the assumption. We observe that inter-class gradient orthogonality can be low, particularly during the initial training stages, yet Adam still mitigates class imbalance. This suggests that Adam may not reduce class imbalance by normalizing element-level gradients. Through the ablation of Adam, we further support that class imbalance can be alleviated without element-wise gradient normalization. This work reveals that, even with inter-class gradient coupling, Adam mitigates class imbalance by normalizing gradients across iterations. During early training, the model primarily fits high-frequency class data; as the loss for these diminishes, it adapts to low-frequency classes. Due to the inter-iteration normalization, the gradient magnitudes for low-frequency classes then approximate the initial high-frequency gradients. This mechanism helps Adam mitigate class imbalance. Consequently, we demonstrate that this mechanism necessitates at least layer-wise gradient normalization across iterations, since most neural networks exhibit layer-level inconsistencies between forward and backward propagation. Finally, we further explore potential limitations in Adam’s ability to address the inconsistencies.
Primary Area: interpretability and explainable AI
Submission Number: 10105
Loading