Keywords: large model, finetuning, effciency, catastrophic forgetting
Abstract: Finetuning large pretrained neural networks is known to be resource-intensive, both
in terms of memory and computational cost. To mitigate this, a common approach is
to restrict training to a subset of the model parameters. By analyzing the relationship
between gradients and weights during finetuning, we observe a notable pattern:
large gradients are often associated with small-magnitude weights. This correlation
is more pronounced in finetuning settings than in training from scratch. Motivated
by this observation, we propose NANOADAM, which dynamically updates only the
small-magnitude weights during finetuning and offers several practical advantages:
first, the criterion is gradient-free—the parameter subset can be determined without
gradient computation; second, it preserves large-magnitude weights, which are
likely to encode critical features learned during pretraining, thereby reducing the
risk of catastrophic forgetting; thirdly, it permits the use of larger learning rates
and consistently leads to better generalization performance in experiments. We
demonstrate this for both NLP and vision tasks.
Submission Number: 24
Loading