Long-tailed Learning with Muon Optimizer

ICLR 2026 Conference Submission10542 Authors

18 Sept 2025 (modified: 24 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: machine learning, imbalanced learning
Abstract: Long-tailed recognition poses a significant challenge in deep learning, as models tend to be biased towards head classes, leading to poor generalization on underrepresented tail classes. A key factor contributing to this issue is that the optimization process for tail classes often stalls in sharp regions of the loss landscape. In this work, we investigate this problem from an optimization perspective and leverage the recently proposed Muon optimizer. We provide new theoretical insights, demonstrating that Muon's gradient orthogonalization enhances the update's projection along directions of negative curvature, thereby facilitating a more effective escape from sharp minima. To further mitigate the additional computational overhead of Muon, we propose Progressive Muon Optimizer (ProMO), a novel hybrid optimization approach that balances performance with efficiency. Specifically, ProMO employs a sinusoidal probability schedule to dynamically alternate between SGD and Muon. This method predominantly uses computationally efficient SGD in the early stages of training and gradually increases the use of Muon as the model approaches convergence when escaping sharp minima becomes critical for tail-class generalization. Extensive experiments on large-scale long-tailed benchmarks demonstrate that ProMO consistently outperforms existing long-tailed recognition methods. These results validate that ProMO effectively improves generalization on tail classes without incurring significant computational costs, highlighting its potential as a practical and effective solution for long-tailed learning.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 10542
Loading