Keywords: Optimization, Lion, Deep Learning
Abstract: Lion is a novel optimization method that has outperformed traditional optimizers like Adam across a variety of tasks. Despite its empirical success, the reasons behind Lion's superiority remain unclear. In this paper, we investigate the mechanisms contributing to Lion's enhanced performance, focusing on the structured noise introduced by the use of the sign function in gradient updates. We characterize this noise by the angle of rotation between a vector and its signum. We inject this noise as a random fixed-angle rotation into normalized updates and analyze how the performance of this method compares to that of Lion. We demonstrate that this method has stronger performance than Lion in our setting. This approach reveals a relationship between the learning rate and the noise specific to the Lion method, providing insights into its improved performance metrics. Additionally, we identify an effect we term "momentum tracing" in neural networks with normalization layers and ReLU activations, which can significantly destabilize the training process. Our analysis demonstrates that the rotation noise inherent in Lion mitigates the negative impact of "momentum tracing", leading to more stable learning. These findings offer theoretical justification for Lion's effectiveness and suggest avenues for developing more robust optimization algorithms.
Submission Number: 128
Loading