Rank Allocation in Low-Rank Optimizers

Published: 29 May 2026, Last Modified: 08 Jun 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: optimization, low-rank
TL;DR: Low-rank optimizers (Muon, Dion, GaLore) split rank across attention/FFN by a hardcoded constant. We prove a matched-budget lower bound (Lean 4) and show the standard (1,4,1) split costs +0.015 nats on a 166M LLaMA. Defaults are off.
Abstract: Low-rank spectral optimizers such as Muon, Dion, PowerSGD, and GaLore expose a per-layer rank budget, yet the split of that budget across attention and feed-forward matrices is usually fixed by a simple rule rather than measured. We formalize this design choice as a rank-profile allocation problem, distinguish the Ky-Fan capture geometry relevant to orthogonalized descent from the Frobenius geometry relevant to low-rank compression, and prove a conditional matched-budget lower bound in terms of measurable per-layer spectral margins, with the central inequality machine-checked in Lean~4. In a $166$M-parameter LLaMA-style decoder trained with Dion at $d_{head}=128$, three paired seeds consistently separate uniform, structural, and rank-inverted profiles. At this operating point the uniform profile attains the lowest validation loss, outperforming the published $(1,4,1)$ structural rule by $0.015$ nats $[0.009,\,0.026]$ and the lower-budget rank-inverted stress test by $0.033$ nats $[0.027,\,0.041]$, with all paired-seed differences of the same sign. Rank allocation is therefore a measurable and consequential architecture-level design axis, while the published structural constants are not the optimal operating point for this configuration.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 207
Loading