Keywords: Continual learning, model merging, fine-tuning
Abstract: In continual learning with pretrained large language models (LLMs), where data from instruction fine-tuning (IFT) tasks arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earlier tasks.
This is especially pronounced when the IFT tasks come from diverse domains.
In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the LLM has learned?
Inspired by a classical continual learning method, L2 penalty to previous weights, we propose Sequential Fine-tuning with Averaging (SFA), a method that merges models with earlier checkpoints trained on previous tasks during the course of training.
SOTA approaches typically maintain a data buffer of past tasks or impose a penalty at each gradient step. However, our method achieves comparable results without the need to store past data, or multiple copies of parameters for each gradient step.
Furthermore, our method outperforms penalty methods like L2 regression and EWC, as well as other common merging techniques such as Task Arithmetic, and TIES Merging.
Finally, we show that using our method, a single model can simultaneously perform well on a range of fine-tuning tasks in diverse domains, including Math, Law and Code.
Primary Area: transfer learning, meta learning, and lifelong learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11350
Loading