To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

Sara Dragutinović; Rajesh Ranganath

To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

Sara Dragutinović, Rajesh Ranganath

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Optimizers, Muon, SGD, Simplicity Bias

Abstract: Out of the recently introduced optimizers, Muon has perhaps gained the highest popularity due to its superior training speed. While many papers focus on the benefits of Muon, our paper questions if there are any downsides this speedup brings. We explore the biases induced when optimizing with Muon, providing theoretical analysis and its consequences to the learning trajectories and solutions learned. While the theory does explain the benefits Muon brings, it also guides our intuition when coming up with a couple of examples where Muon optimized models may be disadvantaged, due to losing a simplicity bias. More broadly, this paper should serve as a reminder: when developing new optimizers, it is essential to consider the biases they introduce, as these biases can fundamentally change a model’s behavior—for better or for worse.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Style Files: I have used the style files.

Submission Number: 69

Loading