Parameter Averaging Is All You Need To Prevent Forgetting

Peter Plantinga; Jaekwon Yoo; Abenezer Girma; Chandra Dhir

Parameter Averaging Is All You Need To Prevent Forgetting

Peter Plantinga, Jaekwon Yoo, Abenezer Girma, Chandra Dhir

Published: 01 Jan 2024, Last Modified: 12 May 2025SLT 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Continual learning for end-to-end automatic speech recognition has to contend with a number of difficulties. Fine-tuning strategies tend to lose performance on data that has been previously trained on, a phenomenon known as catastrophic forgetting. Adapters can help by allowing easy switching between fine-tuned models, but adapted models lose performance on data from other domains, which is a problem if you don’t know what domain your input data comes from. We propose a solution that reduces forgetting to only 3.4% while exceeding the average performance of solutions fine-tuned on all available data, which even with LoRA has a forgetting rate of 49%. Our experiments on diverse datasets and models show that a linear interpolation of several models’ parameters, each fine-tuned from the same generalist model, results in a unified model that performs well on all tested data. In addition, the same model can be fine-tuned and averaged multiple times while maintaining low rates of forgetting.

Loading