Probability-dependent gradient decay in large margin softmax

Siyuan Zhang; Linbo Xie

Probability-dependent gradient decay in large margin softmax

Siyuan Zhang, Linbo Xie

19 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: visualization or interpretation of learned representations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Gradient decay; Large margin Softmax; Local Lipschitz constraint; Curriculum learning; Model calibration

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: In this paper, a gradient decay hyperparameter is introduced in Softmax to control the probability-dependent gradient decay rate. By following the theoretical analysis and empirical results, we find that the generalization and calibration depend significantly on the gradient decay rate as the confidence probability rises, i.e., the gradient decreases convexly or concavely as the sample probability increases. Moreover, optimization with the small gradient decay shows a curriculum learning sequence where hard samples are in the spotlight only after easy samples are convinced sufficiently, and well-separated samples gain a higher gradient to reduce intra-class distance. Unfortunately, the small gradient decay exacerbates model overconfidence, shedding light on the causes of the poor calibration observed in modern neural networks. Conversely, a large gradient decay significantly mitigates these issues, outperforming even the model employing post-calibration methods. Based on the analysis results, we can provide evidence that the large margin Softmax will affect the local Lipschitz constraint by regulating the probability-dependent gradient decay rate. This paper provides a new perspective and understanding of the relationship among large margin Softmax, curriculum learning and model calibration by analyzing the gradient decay rate. Besides, we propose a warm-up strategy to dynamically adjust gradient decay.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1676

Loading