Keywords: optimization, deep learning, reparameterization, robustness
TL;DR: We show that a mathematically equivalent reparameterization of Shampoo and SOAP -style optimizers is more robust to lower numerical precision and can use efficient subspace updates to their preconditions.
Abstract: Shampoo-based methods, such as KL-Shampoo and SOAP, have demonstrated strong performance in training neural networks and leverage QR decompositions. As existing QR implementations require single-precision arithmetic and remain computationally expensive, these methods become time- and memory-intensive when their preconditioning matrices are large. Moreover, using half-precision (BFP16) storage to reduce memory can degrade the performance of Shampoo-based methods. We propose a reparametrization of the preconditioner that supports half-precision storage, and also enables efficient QR-based updates in subspaces while retaining single-precision arithmetic and thereby reducing both computational cost and memory overhead. It applies broadly to Shampoo-based methods that employ QR decomposition, including KL-Shampoo and SOAP. Our approach mitigates the performance degradation of these methods under half-precision storage and, overall, makes them more memory- and time-efficient.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 138
Loading