On the Surprising Effectiveness of Masking Updates in LLM Training

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM optimization, loss landscape geometry, curvature-dependent regularization
Abstract: Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant consistently outperforming the corresponding dense optimizer. Our analysis shows that random masking induces a curvature-dependent regularization term that penalizes sharp update directions. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Across extensive LLM pre-training experiments up to 1B parameters, Magma improves a broad range of optimizers—including Adam, Lion, Muon, and SOAP—in 14 out of 16 cases for dense architectures, and successfully improves both Adam and Muon in sparse mixture-of-experts (MoE) architectures. Our analysis shows that alignment-based masking can widen the stable optimization regime by damping noisy, high-curvature blocks while preserving descent direction.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 144
Loading