Keywords: Lipschitz deep learning, transformers, optimization
TL;DR: We develop norm-constraint methods that enforce Lipschitz bounds for transformer models throughout training, scaling up to NanoGPT.
Abstract: Neural networks are often highly sensitive to input and weight perturbations. This sensitivity has been linked to pathologies such as vulnerability to adversarial examples, divergent training, and overfitting. To combat these problems, past research has looked at building neural networks entirely from Lipschitz components. However, these techniques have not matured to the point where researchers have trained a modern architecture such as a transformer with a Lipschitz certificate enforced beyond initialization. To explore this gap, we begin by developing and benchmarking novel, computationally-efficient tools for maintaining norm-constrained weight matrices. Applying these tools, we are able to train transformer models with Lipschitz bounds enforced throughout training. We find that optimizer dynamics matter: switching from AdamW to Muon improves standard methods---weight decay and spectral normalization---allowing models to reach equal performance with a lower Lipschitz bound. Inspired by Muon's update having a fixed spectral norm, we co-design a weight constraint method that improves the Lipschitz vs. performance tradeoff on MLPs and 2M parameter transformers. Our <=2-Lipschitz transformer on Shakespeare text reaches validation accuracy 60%. Scaling to 140M parameters, our <=10-Lipschitz transformer reaches 21% accuracy on internet text. When matching the NanoGPT baseline accuracy of 37.4%, our Lipschitz-bounded network achieves a maximum activation norm of 112, compared to about 1,872 for the unconstrained network. Our Lipschitz transformers train without stability measures, such as layer norm, QK norm, and logit tanh softcapping.
Primary Area: optimization
Submission Number: 15356
Loading