Abstract: Mixture-of-Experts (MoE) benefits from a dynamic routing mechanism among their specialized experts, which existing Parameter-
Efficient Fine-Tuning (PEFT) strategies fail to leverage. This motivates us to investigate whether adaptation modules themselves should incorporate routing mechanisms to align with MoE's multi-expert architecture. We analyze dynamics of core components when applying PEFT to MoE language models and examine how different routing strategies affect adaptation effectiveness. Extensive experiments adapting OLMoE-1B-7B and Mixtral-8×7B on various commonsense and math reasoning tasks validate the performance and efficiency of our routed approach. We identify the optimal configurations for different scenarios and provide empirical analyses with practical insights to facilitate better PEFT and MoE applications.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: Mixture-of-Experts, parameter-efficient-training, fine-tuning, LLM, efficiency
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: In the Limitations section we discussed the potential risks including bias, societal risks, and environmental impact.
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: In section 3 we cited creators of all major artifacts used including the 2 opensource LLMs and all 14 benchmark datasets used in evaluation.
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: All models and datasets are licensed under Apache 2.0 license, which permits research and educational use consistent with this study's methodology.
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: In section 3 we elaborated our usage of the mentioned opensource models and datasets, which aligns with their licensed usage for research and educational purposes.
B4 Data Contains Personally Identifying Info Or Offensive Content: Yes
B4 Elaboration: In the Limitations section we acknowledge that the training data may contain demographic and geographic skews inherited from web corpora. We did not implement special procedures to check for PII or offensive content beyond what is standard in the field. We used well-established, publicly available models (OLMoE-1B-7B, Mixtral-8×7B) and benchmark datasets that are widely adopted in NLP research. Following standard practice in parameter-efficient fine-tuning research, we utilized these artifacts as-is for educational and research purposes only. The models and datasets have undergone the data curation processes implemented by their original creators, and our work focuses on methodological improvements to fine-tuning techniques rather than data collection or curation.
B5 Documentation Of Artifacts: Yes
B5 Elaboration: All opensource models and datasets have comprehensive documentation provided in their corresponding papers which we have cited in section 3.
B6 Statistics For Data: Yes
B6 Elaboration: Section 3.1 provides relevant statistics including the use of Commonsense170K and Math50K training sets, evaluation across 14 tasks, and details about model sizes and parameters.
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: In Section 3.1 and Appendix A.1 we report model parameters (active / trainable / total parameters) and computing infrastructure (single NVIDIA A100 GPU for OLMoE, 4×NVIDIA H100 GPUs for Mixtral).
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Section 3.1 describes the experimental setup including datasets, baselines, and evaluation methodology. Appendix A.1 and Table 2 provide detailed hyperparameter configurations including learning rates, batch sizes, epochs, and optimizer settings for both models.
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 3 and all major results report average performance calculated across individual benchmark test sets (8 commonsense reasoning and 6 arithmetic reasoning tasks), following established evaluation practices in the field.
C4 Parameters For Packages: Yes
C4 Elaboration: Section 3.1 mentions using benchmark suites and evaluation frameworks from Hu et al. (2023) , and Appendix A.1 mentions specific training frameworks.
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 1465
Loading