Two-Sided Low-Rank SOAP for Efficient LLM Training

08 May 2026 (modified: 09 May 2026)ICML 2026 Workshop CoLoRAI SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: low-rank optimization, SOAP, adaptive optimizers, language model training
TL;DR: ALSO is a two-sided low-rank approximation to SOAP that improves over Alice while retaining substantially lower optimizer-state memory than full SOAP.
Abstract: Matrix-aware adaptive optimizers such as Shampoo and SOAP exploit row and column structure in neural-network weights, but their optimizer states are expensive for large language models. Recent low-rank optimizers reduce memory by keeping adaptive statistics in a small subspace, yet most existing variants are one-sided: they compress only one matrix axis and therefore approximate only a restricted part of SOAP's two-sided geometry. We propose **A**daptive **L**ow-Dimensional Subspace **SO**AP Method (**ALSO**), a compact two-sided extension of Alice, an optimizer that estimates adaptive low-dimensional gradient subspaces and compensates discarded directions. For a matrix parameter, ALSO maintains low-dimensional row and column subspaces, applies Adam-like scaling in the resulting core coordinates, and compensates the three residual blocks outside the core. The method can be viewed as a two-sided low-rank approximation to SOAP that keeps full-parameter updates while reducing optimizer-state cost. We evaluate ALSO on C4 pre-training with LLaMA-style models.
Submission Number: 130
Loading