AMUSE: Anytime Muon with Stable Gradient Evaluation

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Optimization, Muon, Schedule-Free, Loss landscape, River-valley
Abstract: Modern deep learning commonly relies on AdamW with learning rate schedules. Schedule-Free optimization and Muon challenge this standard recipe from complementary directions: the former removes the need for explicit schedules, while the latter offers an alternative to AdamW. Despite Muon's strong empirical performance, the mechanism behind its improvement remains only partially understood. We study this question through the river-valley perspective by examining how Muon updates decompose into river and valley directions. We show that Muon's orthogonalization increases the river component of its updates, which helps accelerate progress, but can also leave residual valley components that lead to oscillatory trajectories. Building on this, we propose **Anytime MUon with Stable gradient Evaluation (AMUSE)**, which integrates Muon's rapid river progress with the stabilizing effect of Schedule-Free optimizer. AMUSE initially evaluates gradients near the fast Muon sequence for rapid adaptation, then gradually shifts toward the stable averaged sequence to reduce valley-wall oscillations. As a result, AMUSE requires no learning rate schedules and supports anytime training. Across vision tasks and large language model pretraining, AMUSE consistently improves the performance-iteration Pareto frontier over (Schedule-Free) AdamW and Muon.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 150
Loading