Keywords: Parameter efficient fine-tuning, mixture of experts, sparse upcycling
TL;DR: we introduce a Mixture of Layer Experts (MoLEx) model that uses a lean upcycle of the layers in the pre-trained model to further improve performance of parameter efficient fine-tuning (PEFT)
Abstract: Large-scale pre-training of deep models, followed by fine-tuning them to adapt to downstream tasks, is currently the cornerstone of natural language processing (NLP). The massive size of these models has led to remarkable success in many NLP tasks. However, a detriment is the expense required to retrain all the base model's parameters for the adaptation to each task or domain. Parameter Efficient Fine-Tuning (PEFT) provides a highly effective solution for this challenge by minimizing the number of parameters required to be trained while maintaining the quality of the model. In this paper, we study layers as extractors of different types of linguistic information that are valuable when used in conjunction with each other. We then propose the Mixture of Layer Experts (MoLEx), a novel sparse mixture of experts (SMoE) whose experts are layers in the pre-trained model. It performs a conditional computation of a mixture of layers during fine-tuning to provide the model with more structural knowledge about the data. By providing an avenue for information exchange between layers, MoLEx enables the model to make a more well-informed prediction for the downstream task, leading to better fine-tuning results with the same number of effective parameters. As experts can be processed in parallel, MoLEx introduces minimal additional computational overhead. We empirically corroborate the advantages of MoLEx when combined with popular PEFT baseline methods on a variety of downstream fine-tuning tasks, including the popular GLUE benchmark and End-to-End Challenge (E2E).
Submission Number: 29
Loading