Keywords: Shampoo, SOAP, covariance estimation, Kullback–Leibler divergence, Gaussian, optimization
TL;DR: A new Kullback–Leibler perspective to interpret and improve Shampoo and SOAP
Abstract: Shampoo and its efficient, Adam-stabilized variant SOAP, employ structured second-moment estimation and have received growing attention for their effectiveness.
In practice, Shampoo requires step-size grafting with Adam to achieve competitive performance. SOAP mitigates this by applying Adam in Shampoo's eigenbasis and further reducing per-iteration runtime.
However, reliance on Adam introduces additional memory overhead in both methods.
Prior theoretical interpretations have primarily examined their estimation schemes using the Frobenius norm. Motivated by the natural correspondence between the second moment and a covariance matrix, we reinterpret the estimation procedures in Shampoo and SOAP as instances of covariance estimation through the lens of Kullback–Leibler (KL) divergence minimization. This perspective reveals a previously overlooked theoretical limitation and motivates principled improvements to their design.
Building on the KL perspective, we propose practical estimation schemes---KL-Shampoo and KL-SOAP---that
match or exceed the performance of Shampoo and SOAP for pre-training a range of neural network models while maintaining SOAP-level per-iteration runtime. Notably, KL-Shampoo does not rely on Adam to achieve superior performance, thereby avoiding the associated memory overhead. Surprisingly, KL-Shampoo consistently outperforms the other methods in our experiments.
Primary Area: optimization
Submission Number: 9555
Loading