Keywords: model plasticity, weight decay, language models, hyperparameter optimization
TL;DR: Weight decay improves language model plasticity and models that perform worse after pretraining can surprisingly perform better after fine-tuning
Abstract: The prevailing paradigm in large language model (LLM) development is to pretrain a base model, then perform further training to improve performance and model behavior. However, hyperparameter optimization and scaling laws have been studied primarily from the perspective of the base model's validation loss, ignoring downstream adaptability. In this work, we study pretraining from the perspective of model plasticity -- the base model's ability to adapt to downstream tasks through fine-tuning. We focus on the role of weight decay, a key pretraining hyperparameter. Through systematic experiments, we find that models pretrained with stronger weight decay are more plastic, showing larger performance gains when fine-tuned on downstream tasks. This phenomenon can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after fine-tuning. Further investigation of weight decay's mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention, and reduces overfitting on the training data. In conclusion, this work casts light on the multifaceted role of a single optimization hyperparameter in shaping model behavior and demonstrates the importance of using metrics beyond the cross-entropy loss for hyperparameter optimization.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Challenge: This submission is an entry to the science of DL improvement challenge.
Submission Number: 40
Loading