Abstract: Sharpness-aware minimization (SAM) has been shown to improve the generalization of neural networks. However, the method comes at the expense of storing a perturbation of the model parameters, which can be restrictive when memory bound. We design a variant of SAM, called $\nu$SAM, which obtains a low-rank perturbation by modifying the perturbation constraint. The update almost entirely removes the memory footprint of the perturbation without increasing the computational complexity, thus achieving close to a $1/3$ memory saving regarding the parameters when using SGD as the base optimizer. We demonstrate comparable performance of $\nu$SAM with SAM on vision transformers both when training models from scratch and for fine-tuning. Interestingly, $\nu$SAM seems to significantly improve performance for MLP-Mixer architectures across both settings. The results are corroborated theoretically, where we show that SAM with an \emph{arbitrary} norm choice (which includes $\nu$SAM) can converge even with fixed perturbation radius.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Konstantin_Mishchenko1
Submission Number: 3459
Loading